Title: “IST707 HW5 Use Decision Tree to Solve a Mystery in History”
Name: Sathish Kumar Rajendiran
Date: 08/12/2020

Exercise: Use Decision Tree to Solve a Mystery in History: who wrote the disputed essays, Hamilton or Madison?

In this homework assignment, you are going to use the decision tree algorithm to solve the disputed essay problem. Last week you used clustering techniques to tackle this problem.

Organize your report using the following template:

Section 1: Data preparation You will need to separate the original data set to training and testing data for classification experiments. Describe what examples in your training and what in your test data.

Section 2: Build and tune decision tree models First build a DT model using default setting, and then tune the parameters to see if better model can be generated. Compare these models using appropriate evaluation measures. Describe and compare the patterns learned in these models.

Section 3: Prediction After building the classification model, apply it to the disputed papers to find out the authorship. Does the DT model reach the same conclusion as the clustering algorithms did?

# import libraries 
#create a function to ensure the libraries are imported
EnsurePackage <- function(x){
  x <- as.character(x)
    if (!require(x,character.only = TRUE)){
      install.packages(pkgs=x, repos = "http://cran.us.r-project.org")
      require(x, character.only = TRUE)
    }
  }
# usage example, to load the necessary library for further processing...
EnsurePackage("ggplot2")
Loading required package: ggplot2
EnsurePackage("RColorBrewer")
Loading required package: RColorBrewer
EnsurePackage("NbClust")
Loading required package: NbClust
EnsurePackage("caret")
Loading required package: caret
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     
EnsurePackage("rpart")
Loading required package: rpart
EnsurePackage("rpart.plot")
Loading required package: rpart.plot
EnsurePackage("randomForest")
Loading required package: randomForest
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:ggplot2’:

    margin
cat("All Packages are available")
All Packages are available
#Load CSV into data frame
  filepath <- "/Users/sathishrajendiran/Documents/R/fedPapers85.csv"
  fedPapersDF <- data.frame(read.csv(filepath,na.strings=c(""," ","NA")),stringsAsFactors=FALSE)
  dim(fedPapersDF) #85 72
[1] 85 72
# Preview the structure 
  str(fedPapersDF)
'data.frame':   85 obs. of  72 variables:
 $ author  : Factor w/ 5 levels "dispt","Hamilton",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ filename: Factor w/ 85 levels "dispt_fed_49.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ a       : num  0.28 0.177 0.339 0.27 0.303 0.245 0.349 0.414 0.248 0.442 ...
 $ all     : num  0.052 0.063 0.09 0.024 0.054 0.059 0.036 0.083 0.04 0.062 ...
 $ also    : num  0.009 0.013 0.008 0.016 0.027 0.007 0.007 0.009 0.007 0.006 ...
 $ an      : num  0.096 0.038 0.03 0.024 0.034 0.067 0.029 0.018 0.04 0.075 ...
 $ and     : num  0.358 0.393 0.301 0.262 0.404 0.282 0.335 0.478 0.356 0.423 ...
 $ any     : num  0.026 0.063 0.008 0.056 0.04 0.052 0.058 0.046 0.034 0.037 ...
 $ are     : num  0.131 0.051 0.068 0.064 0.128 0.111 0.087 0.11 0.154 0.093 ...
 $ as      : num  0.122 0.139 0.203 0.111 0.148 0.252 0.073 0.074 0.161 0.1 ...
 $ at      : num  0.017 0.114 0.023 0.056 0.013 0.015 0.116 0.037 0.047 0.031 ...
 $ be      : num  0.411 0.393 0.474 0.365 0.344 0.297 0.378 0.331 0.289 0.379 ...
 $ been    : num  0.026 0.165 0.015 0.127 0.047 0.03 0.044 0.046 0.027 0.025 ...
 $ but     : num  0.009 0 0.038 0.032 0.061 0.037 0.007 0.055 0.027 0.037 ...
 $ by      : num  0.14 0.139 0.173 0.167 0.209 0.186 0.102 0.092 0.168 0.174 ...
 $ can     : num  0.035 0 0.023 0.056 0.088 0 0.058 0.037 0.047 0.056 ...
 $ do      : num  0.026 0.013 0 0 0 0 0.015 0.028 0 0 ...
 $ down    : num  0 0 0.008 0 0 0.007 0 0 0 0 ...
 $ even    : num  0.009 0.025 0.015 0.024 0.02 0.007 0.007 0.018 0 0.006 ...
 $ every   : num  0.044 0 0.023 0.04 0.027 0.007 0.087 0.064 0.081 0.05 ...
 $ for.    : num  0.096 0.076 0.098 0.103 0.141 0.067 0.116 0.055 0.127 0.1 ...
 $ from    : num  0.044 0.101 0.053 0.079 0.074 0.096 0.08 0.083 0.074 0.124 ...
 $ had     : num  0.035 0.101 0.008 0.016 0 0.022 0.015 0.009 0.007 0 ...
 $ has     : num  0.017 0.013 0.015 0.024 0.054 0.015 0.036 0.037 0.02 0.019 ...
 $ have    : num  0.044 0.152 0.023 0.143 0.047 0.119 0.044 0.074 0.074 0.044 ...
 $ her     : num  0 0 0 0 0 0 0.007 0 0.034 0.025 ...
 $ his     : num  0.017 0 0 0.024 0.02 0.067 0 0.018 0.02 0.05 ...
 $ if.     : num  0 0.025 0.023 0.04 0.034 0.03 0.029 0 0 0.025 ...
 $ in.     : num  0.262 0.291 0.308 0.238 0.263 0.401 0.189 0.267 0.248 0.274 ...
 $ into    : num  0.009 0.025 0.038 0.008 0.013 0.037 0 0.037 0.013 0.037 ...
 $ is      : num  0.157 0.038 0.15 0.151 0.189 0.26 0.167 0.083 0.208 0.23 ...
 $ it      : num  0.175 0.127 0.173 0.222 0.108 0.156 0.102 0.165 0.134 0.131 ...
 $ its     : num  0.07 0.038 0.03 0.048 0.013 0.015 0 0.046 0.02 0.019 ...
 $ may     : num  0.035 0.038 0.12 0.056 0.047 0.074 0.08 0.092 0.027 0.106 ...
 $ more    : num  0.026 0 0.038 0.056 0.067 0.045 0.08 0.064 0.06 0.081 ...
 $ must    : num  0.026 0.013 0.083 0.071 0.013 0.015 0.044 0.018 0.027 0.068 ...
 $ my      : num  0 0 0 0 0 0 0.007 0 0 0 ...
 $ no      : num  0.035 0 0.03 0.032 0.047 0.059 0.022 0.018 0.02 0.044 ...
 $ not     : num  0.114 0.127 0.068 0.087 0.128 0.134 0.102 0.101 0.094 0.106 ...
 $ now     : num  0 0 0 0 0 0 0.007 0 0.007 0.012 ...
 $ of      : num  0.9 0.747 0.858 0.802 0.869 ...
 $ on      : num  0.14 0.139 0.15 0.143 0.054 0.141 0.051 0.083 0.127 0.118 ...
 $ one     : num  0.026 0.025 0.03 0.032 0.047 0.052 0.073 0.046 0.06 0.031 ...
 $ only    : num  0.035 0 0.023 0.048 0.027 0.022 0.007 0.046 0.02 0.012 ...
 $ or      : num  0.096 0.114 0.06 0.064 0.081 0.074 0.153 0.037 0.154 0.081 ...
 $ our     : num  0.017 0 0 0.016 0.027 0.03 0.051 0 0.007 0.025 ...
 $ shall   : num  0.017 0 0.008 0.016 0 0.015 0.007 0 0.02 0 ...
 $ should  : num  0.017 0.013 0.068 0.032 0 0.03 0.007 0 0 0.012 ...
 $ so      : num  0.035 0.013 0.038 0.04 0.027 0.007 0.051 0.018 0.04 0.05 ...
 $ some    : num  0.009 0.063 0.03 0.024 0.067 0.045 0.007 0.028 0.027 0.025 ...
 $ such    : num  0.026 0 0.045 0.008 0.027 0.015 0.015 0 0.013 0.031 ...
 $ than    : num  0.009 0 0.023 0 0.047 0.03 0.109 0.055 0.067 0.044 ...
 $ that    : num  0.184 0.152 0.188 0.238 0.162 0.208 0.233 0.165 0.208 0.218 ...
 $ the     : num  1.43 1.25 1.49 1.33 1.19 ...
 $ their   : num  0.114 0.165 0.053 0.071 0.027 0.089 0.109 0.083 0.154 0.081 ...
 $ then    : num  0 0 0.015 0.008 0.007 0.007 0.015 0.009 0.007 0.012 ...
 $ there   : num  0.009 0 0.015 0 0.007 0.007 0.036 0.028 0.02 0 ...
 $ things  : num  0.009 0 0 0 0 0 0 0 0 0.012 ...
 $ this    : num  0.044 0.051 0.075 0.103 0.094 0.126 0.08 0.11 0.067 0.093 ...
 $ to      : num  0.507 0.355 0.361 0.532 0.485 0.445 0.56 0.34 0.49 0.498 ...
 $ up      : num  0 0 0 0 0 0 0.007 0 0 0 ...
 $ upon    : num  0 0.013 0 0 0 0 0 0 0 0 ...
 $ was     : num  0.009 0.051 0.008 0.087 0.027 0.007 0.015 0.018 0.027 0 ...
 $ were    : num  0.017 0 0.015 0.079 0.02 0.03 0.029 0.009 0.007 0 ...
 $ what    : num  0 0 0.008 0.008 0.02 0.015 0.015 0.009 0.02 0.025 ...
 $ when    : num  0.009 0 0 0.024 0.007 0.037 0.007 0 0.02 0.012 ...
 $ which   : num  0.175 0.114 0.105 0.167 0.155 0.186 0.211 0.175 0.201 0.199 ...
 $ who     : num  0.044 0.038 0.008 0 0.027 0.045 0.022 0.018 0.04 0.031 ...
 $ will    : num  0.009 0.089 0.173 0.079 0.168 0.111 0.145 0.267 0.154 0.106 ...
 $ with    : num  0.087 0.063 0.045 0.079 0.074 0.089 0.073 0.129 0.027 0.081 ...
 $ would   : num  0.192 0.139 0.068 0.064 0.04 0.037 0.073 0.037 0.04 0.031 ...
 $ your    : num  0 0 0 0 0 0 0 0 0 0 ...
# Analyze the spread  
  summary(fedPapersDF)
      author               filename        a               all               also                an               and              any               are         
 dispt   :11   dispt_fed_49.txt: 1   Min.   :0.0960   Min.   :0.01500   Min.   :0.000000   Min.   :0.00900   Min.   :0.2170   Min.   :0.00000   Min.   :0.01300  
 Hamilton:51   dispt_fed_50.txt: 1   1st Qu.:0.2400   1st Qu.:0.03500   1st Qu.:0.000000   1st Qu.:0.04900   1st Qu.:0.3190   1st Qu.:0.02500   1st Qu.:0.05100  
 HM      : 3   dispt_fed_51.txt: 1   Median :0.2990   Median :0.05000   Median :0.007000   Median :0.07100   Median :0.3580   Median :0.04300   Median :0.06800  
 Jay     : 5   dispt_fed_52.txt: 1   Mean   :0.2932   Mean   :0.05284   Mean   :0.007659   Mean   :0.06839   Mean   :0.3846   Mean   :0.04161   Mean   :0.07707  
 Madison :15   dispt_fed_53.txt: 1   3rd Qu.:0.3490   3rd Qu.:0.06600   3rd Qu.:0.013000   3rd Qu.:0.08500   3rd Qu.:0.4130   3rd Qu.:0.05600   3rd Qu.:0.10200  
               dispt_fed_54.txt: 1   Max.   :0.4660   Max.   :0.12700   Max.   :0.047000   Max.   :0.17900   Max.   :0.8210   Max.   :0.11400   Max.   :0.16300  
               (Other)         :79                                                                                                                               
       as               at                be              been              but                by              can                do                down         
 Min.   :0.0270   Min.   :0.00000   Min.   :0.0400   Min.   :0.00000   Min.   :0.00000   Min.   :0.0270   Min.   :0.00000   Min.   :0.000000   Min.   :0.000000  
 1st Qu.:0.1000   1st Qu.:0.02600   1st Qu.:0.2580   1st Qu.:0.03000   1st Qu.:0.02200   1st Qu.:0.0920   1st Qu.:0.01400   1st Qu.:0.000000   1st Qu.:0.000000  
 Median :0.1240   Median :0.03800   Median :0.3070   Median :0.05300   Median :0.03200   Median :0.1240   Median :0.02900   Median :0.006000   Median :0.000000  
 Mean   :0.1242   Mean   :0.04427   Mean   :0.3012   Mean   :0.05967   Mean   :0.03232   Mean   :0.1272   Mean   :0.03558   Mean   :0.006259   Mean   :0.001529  
 3rd Qu.:0.1440   3rd Qu.:0.06300   3rd Qu.:0.3580   3rd Qu.:0.08400   3rd Qu.:0.04200   3rd Qu.:0.1620   3rd Qu.:0.05200   3rd Qu.:0.010000   3rd Qu.:0.000000  
 Max.   :0.2520   Max.   :0.11800   Max.   :0.4810   Max.   :0.16500   Max.   :0.08900   Max.   :0.2640   Max.   :0.11000   Max.   :0.028000   Max.   :0.017000  
                                                                                                                                                                 
      even            every              for.              from              had               has               have              her          
 Min.   :0.0000   Min.   :0.00000   Min.   :0.03000   Min.   :0.02600   Min.   :0.00000   Min.   :0.00000   Min.   :0.01100   Min.   :0.000000  
 1st Qu.:0.0000   1st Qu.:0.00900   1st Qu.:0.07000   1st Qu.:0.05700   1st Qu.:0.00800   1st Qu.:0.02500   1st Qu.:0.07300   1st Qu.:0.000000  
 Median :0.0100   Median :0.02200   Median :0.08800   Median :0.07800   Median :0.01600   Median :0.04600   Median :0.09000   Median :0.000000  
 Mean   :0.0114   Mean   :0.02391   Mean   :0.09376   Mean   :0.07978   Mean   :0.02116   Mean   :0.04442   Mean   :0.09474   Mean   :0.008094  
 3rd Qu.:0.0180   3rd Qu.:0.03400   3rd Qu.:0.11400   3rd Qu.:0.09800   3rd Qu.:0.02700   3rd Qu.:0.05700   3rd Qu.:0.12400   3rd Qu.:0.007000  
 Max.   :0.0370   Max.   :0.08700   Max.   :0.21300   Max.   :0.16200   Max.   :0.14100   Max.   :0.11400   Max.   :0.18500   Max.   :0.150000  
                                                                                                                                                
      his               if.               in.              into               is               it              its               may               more        
 Min.   :0.00000   Min.   :0.00000   Min.   :0.1890   Min.   :0.00000   Min.   :0.0280   Min.   :0.0750   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.01600   1st Qu.:0.2670   1st Qu.:0.01000   1st Qu.:0.1180   1st Qu.:0.1290   1st Qu.:0.03000   1st Qu.:0.03600   1st Qu.:0.02300  
 Median :0.01400   Median :0.02600   Median :0.3040   Median :0.02200   Median :0.1510   Median :0.1510   Median :0.04200   Median :0.05600   Median :0.04400  
 Mean   :0.02862   Mean   :0.02733   Mean   :0.3174   Mean   :0.02409   Mean   :0.1563   Mean   :0.1567   Mean   :0.04836   Mean   :0.06181   Mean   :0.04561  
 3rd Qu.:0.03900   3rd Qu.:0.03400   3rd Qu.:0.3550   3rd Qu.:0.03400   3rd Qu.:0.1960   3rd Qu.:0.1900   3rd Qu.:0.06400   3rd Qu.:0.08500   3rd Qu.:0.06100  
 Max.   :0.24700   Max.   :0.09900   Max.   :0.4990   Max.   :0.10500   Max.   :0.3230   Max.   :0.2840   Max.   :0.15000   Max.   :0.13400   Max.   :0.13000  
                                                                                                                                                               
      must               my                 no               not               now                 of               on               one         
 Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.02000   Min.   :0.000000   Min.   :0.5620   Min.   :0.00000   Min.   :0.00500  
 1st Qu.:0.01400   1st Qu.:0.000000   1st Qu.:0.02000   1st Qu.:0.07500   1st Qu.:0.000000   1st Qu.:0.8560   1st Qu.:0.04300   1st Qu.:0.02700  
 Median :0.02700   Median :0.000000   Median :0.02900   Median :0.09500   Median :0.005000   Median :0.9020   Median :0.06200   Median :0.03600  
 Mean   :0.03305   Mean   :0.003259   Mean   :0.03236   Mean   :0.09248   Mean   :0.006035   Mean   :0.9094   Mean   :0.06926   Mean   :0.04079  
 3rd Qu.:0.04400   3rd Qu.:0.005000   3rd Qu.:0.04300   3rd Qu.:0.11200   3rd Qu.:0.010000   3rd Qu.:0.9690   3rd Qu.:0.09700   3rd Qu.:0.05000  
 Max.   :0.11100   Max.   :0.056000   Max.   :0.08300   Max.   :0.14800   Max.   :0.026000   Max.   :1.2110   Max.   :0.15600   Max.   :0.13500  
                                                                                                                                                 
      only               or               our            shall             should              so               some              such              than        
 Min.   :0.00000   Min.   :0.02700   Min.   :0.000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.01000   1st Qu.:0.07000   1st Qu.:0.000   1st Qu.:0.00600   1st Qu.:0.01000   1st Qu.:0.01800   1st Qu.:0.00900   1st Qu.:0.01800   1st Qu.:0.02700  
 Median :0.02200   Median :0.08100   Median :0.013   Median :0.01400   Median :0.02700   Median :0.02900   Median :0.01700   Median :0.02900   Median :0.04300  
 Mean   :0.02288   Mean   :0.09674   Mean   :0.023   Mean   :0.01875   Mean   :0.02656   Mean   :0.02982   Mean   :0.01989   Mean   :0.02922   Mean   :0.04396  
 3rd Qu.:0.03400   3rd Qu.:0.11600   3rd Qu.:0.028   3rd Qu.:0.02700   3rd Qu.:0.03800   3rd Qu.:0.04000   3rd Qu.:0.02800   3rd Qu.:0.03800   3rd Qu.:0.05500  
 Max.   :0.06500   Max.   :0.32100   Max.   :0.199   Max.   :0.07900   Max.   :0.09100   Max.   :0.07200   Max.   :0.06700   Max.   :0.08500   Max.   :0.15000  
                                                                                                                                                                
      that            the            their              then              there             things              this               to               up          
 Min.   :0.081   Min.   :0.669   Min.   :0.00500   Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00900   Min.   :0.3330   Min.   :0.000000  
 1st Qu.:0.171   1st Qu.:1.178   1st Qu.:0.05500   1st Qu.:0.000000   1st Qu.:0.00900   1st Qu.:0.000000   1st Qu.:0.06900   1st Qu.:0.4690   1st Qu.:0.000000  
 Median :0.208   Median :1.275   Median :0.08600   Median :0.006000   Median :0.02200   Median :0.000000   Median :0.09000   Median :0.5400   Median :0.000000  
 Mean   :0.212   Mean   :1.281   Mean   :0.08553   Mean   :0.006082   Mean   :0.02638   Mean   :0.002659   Mean   :0.08701   Mean   :0.5358   Mean   :0.003482  
 3rd Qu.:0.244   3rd Qu.:1.423   3rd Qu.:0.10600   3rd Qu.:0.010000   3rd Qu.:0.03900   3rd Qu.:0.006000   3rd Qu.:0.10500   3rd Qu.:0.6060   3rd Qu.:0.006000  
 Max.   :0.380   Max.   :1.803   Max.   :0.18300   Max.   :0.021000   Max.   :0.10500   Max.   :0.015000   Max.   :0.15300   Max.   :0.7760   Max.   :0.032000  
                                                                                                                                                                
      upon              was               were              what              when             which             who               will              with        
 Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0810   Min.   :0.00000   Min.   :0.00600   Min.   :0.02700  
 1st Qu.:0.00000   1st Qu.:0.00900   1st Qu.:0.00700   1st Qu.:0.00500   1st Qu.:0.00000   1st Qu.:0.1180   1st Qu.:0.01600   1st Qu.:0.05200   1st Qu.:0.06100  
 Median :0.02800   Median :0.01500   Median :0.01500   Median :0.01000   Median :0.00900   Median :0.1520   Median :0.02700   Median :0.08100   Median :0.07900  
 Mean   :0.02922   Mean   :0.02584   Mean   :0.02022   Mean   :0.01286   Mean   :0.01174   Mean   :0.1578   Mean   :0.03253   Mean   :0.09865   Mean   :0.07968  
 3rd Qu.:0.05000   3rd Qu.:0.03200   3rd Qu.:0.02900   3rd Qu.:0.02000   3rd Qu.:0.01500   3rd Qu.:0.1830   3rd Qu.:0.04400   3rd Qu.:0.13500   3rd Qu.:0.09200  
 Max.   :0.10200   Max.   :0.18900   Max.   :0.10800   Max.   :0.06000   Max.   :0.07300   Max.   :0.2760   Max.   :0.12900   Max.   :0.34000   Max.   :0.15000  
                                                                                                                                                                 
     would             your         
 Min.   :0.0090   Min.   :0.000000  
 1st Qu.:0.0420   1st Qu.:0.000000  
 Median :0.0780   Median :0.000000  
 Mean   :0.1017   Mean   :0.002024  
 3rd Qu.:0.1470   3rd Qu.:0.000000  
 Max.   :0.3820   Max.   :0.074000  
                                    
# Preview top few rows  
  head(fedPapersDF)
# compare number of articles by authors
  x <- data.frame(table(fedPapersDF$author))
  coul <- brewer.pal(5, "Set2")
  barplot(height=x$Freq, names=x$Var1, col=coul,xlab="Authors", 
        ylab="Number of Papers", 
        main="FedPapers85 by Authors", 
        ylim=c(0,60))

# view the data
  View(fedPapersDF)
fedPapersDF1 <- subset(fedPapersDF,select=-filename)
fedPapersDF1
# Data preparation
  
  # 1. Training Set Preparation

  # Prepare dataframe by removing filename from the list
  fedPapersDF1 <- subset(fedPapersDF,select=-filename)
  fedPapersDF1
  
  set.seed(100)  

  # lets split disputed articles separately
  fedPapersDF_Dispt <- subset(fedPapersDF1, author=='dispt')
  # fedPapersDF_Dispt
  # lets split non-disputed articles separately
  fedPapersDF_authors <- subset(fedPapersDF1, author!='dispt')
  fedPapersDF_authors

  # lets split the non-disputed articles into training and test datasets.creates a value for dividing the data into train and test.
  # In this case the value is defined as   75% of the number of rows in the dataset
 
  sample_size = floor(0.70*nrow(fedPapersDF_authors)) # 65% --> 80% | 70% --> 82%  | 75 -->78% |80% -- 70%
  # sample_size #value of the sample size 55
  # 
  # set seed to ensure you always have same random numbers generated #324 has 100% training accuracy 
  train_index = sample(seq_len(nrow(fedPapersDF_authors)),size = sample_size)

  train_data =fedPapersDF_authors[train_index,] #creates the training dataset with row numbers stored in train_index
  # # table(train_data$author)
  test_data=fedPapersDF_authors[-train_index,]  # creates the test dataset excluding the row numbers mentioned in train_index
  # # table(test_data$author)
  # 
  cat("\nArticles by Author:")

Articles by Author:
  table(fedPapersDF_authors$author)

   dispt Hamilton       HM      Jay  Madison 
       0       51        3        5       15 
  cat("\nTrain_data - Articles by Author:")

Train_data - Articles by Author:
  table(train_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       36        2        3       10 
  cat("\nTest_data - Articles by Author:")

Test_data - Articles by Author:
  table(test_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       15        1        2        5 
# Section 2: Build and tune decision tree models
  
  # grow tree
  rtree <- rpart(author~. ,data=train_data, method='class')

  #summarize rtree values
  summary(rtree)
Call:
rpart(formula = author ~ ., data = train_data, method = "class")
  n= 51 

    CP nsplit rel error xerror      xstd
1 0.60      0       1.0    1.0 0.2169305
2 0.01      1       0.4    0.4 0.1533930

Variable importance
 upon there    on    to    an   and 
   26    21    16    16    10    10 

Node number 1: 51 observations,    complexity param=0.6
  predicted class=Hamilton  expected loss=0.2941176  P(node) =1
    class counts:     0    36     2     3    10
   probabilities: 0.000 0.706 0.039 0.059 0.196 
  left son=2 (35 obs) right son=3 (16 obs)
  Primary splits:
      upon  < 0.0145 to the right, improve=14.497550, (0 missing)
      on    < 0.0915 to the left,  improve=12.450600, (0 missing)
      there < 0.0145 to the right, improve=11.162020, (0 missing)
      to    < 0.499  to the right, improve= 9.435049, (0 missing)
      of    < 0.8705 to the right, improve= 4.936256, (0 missing)
  Surrogate splits:
      there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
      on    < 0.0915 to the left,  agree=0.882, adj=0.625, (0 split)
      to    < 0.474  to the right, agree=0.882, adj=0.625, (0 split)
      an    < 0.064  to the right, agree=0.804, adj=0.375, (0 split)
      and   < 0.421  to the left,  agree=0.804, adj=0.375, (0 split)

Node number 2: 35 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.6862745
    class counts:     0    35     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 3: 16 observations
  predicted class=Madison   expected loss=0.375  P(node) =0.3137255
    class counts:     0     1     2     3    10
   probabilities: 0.000 0.062 0.125 0.188 0.625 
  plotcp(rtree) # plot cross-validation results

  printcp(rtree) # plot cross-validation results

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class")

Variables actually used in tree construction:
[1] upon

Root node error: 15/51 = 0.29412

n= 51 

    CP nsplit rel error xerror    xstd
1 0.60      0       1.0    1.0 0.21693
2 0.01      1       0.4    0.4 0.15339
  # Plot tree | lets Plot decision trees
  rpart.plot(rtree,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree

  rsq.rpart(rtree) # plot approximate R-squared and relative error for different splits (2 plots)

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class")

Variables actually used in tree construction:
[1] upon

Root node error: 15/51 = 0.29412

n= 51 

    CP nsplit rel error xerror    xstd
1 0.60      0       1.0    1.0 0.21693
2 0.01      1       0.4    0.4 0.15339
may not be applicable for this method


  # grow tree  with cp=0
  rtree_0 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 0, maxdepth = 5)

  #summarize rtree values
  summary(rtree_0)
Call:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 0, maxdepth = 5)
  n= 51 

          CP nsplit  rel error    xerror      xstd
1 0.60000000      0 1.00000000 1.0000000 0.2169305
2 0.20000000      1 0.40000000 0.4666667 0.1638321
3 0.13333333      2 0.20000000 0.5333333 0.1731422
4 0.06666667      3 0.06666667 0.5333333 0.1731422
5 0.00000000      4 0.00000000 0.4666667 0.1638321

Variable importance
 upon    an there   and    on    to    no which  been every   if.    it   not    of     a 
   15    13    12    12    10    10     5     5     4     4     2     2     2     2     1 

Node number 1: 51 observations,    complexity param=0.6
  predicted class=Hamilton  expected loss=0.2941176  P(node) =1
    class counts:     0    36     2     3    10
   probabilities: 0.000 0.706 0.039 0.059 0.196 
  left son=2 (35 obs) right son=3 (16 obs)
  Primary splits:
      upon  < 0.0145 to the right, improve=14.497550, (0 missing)
      on    < 0.0915 to the left,  improve=12.450600, (0 missing)
      there < 0.0145 to the right, improve=11.162020, (0 missing)
      to    < 0.499  to the right, improve= 9.435049, (0 missing)
      of    < 0.8705 to the right, improve= 4.936256, (0 missing)
  Surrogate splits:
      there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
      on    < 0.0915 to the left,  agree=0.882, adj=0.625, (0 split)
      to    < 0.474  to the right, agree=0.882, adj=0.625, (0 split)
      an    < 0.064  to the right, agree=0.804, adj=0.375, (0 split)
      and   < 0.421  to the left,  agree=0.804, adj=0.375, (0 split)

Node number 2: 35 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.6862745
    class counts:     0    35     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 3: 16 observations,    complexity param=0.2
  predicted class=Madison   expected loss=0.375  P(node) =0.3137255
    class counts:     0     1     2     3    10
   probabilities: 0.000 0.062 0.125 0.188 0.625 
  left son=6 (6 obs) right son=7 (10 obs)
  Primary splits:
      no    < 0.021  to the left,  improve=5.208333, (0 missing)
      an    < 0.046  to the left,  improve=4.656818, (0 missing)
      which < 0.113  to the left,  improve=4.256818, (0 missing)
      of    < 0.7305 to the left,  improve=3.951923, (0 missing)
      the   < 1.019  to the left,  improve=3.951923, (0 missing)
  Surrogate splits:
      an    < 0.046  to the left,  agree=0.938, adj=0.833, (0 split)
      which < 0.113  to the left,  agree=0.938, adj=0.833, (0 split)
      and   < 0.467  to the right, agree=0.875, adj=0.667, (0 split)
      been  < 0.0275 to the left,  agree=0.875, adj=0.667, (0 split)
      every < 0.0125 to the left,  agree=0.875, adj=0.667, (0 split)

Node number 6: 6 observations,    complexity param=0.1333333
  predicted class=Jay       expected loss=0.5  P(node) =0.1176471
    class counts:     0     1     2     3     0
   probabilities: 0.000 0.167 0.333 0.500 0.000 
  left son=12 (3 obs) right son=13 (3 obs)
  Primary splits:
      an  < 0.032  to the right, improve=2.333333, (0 missing)
      and < 0.5665 to the left,  improve=2.333333, (0 missing)
      if. < 0.0175 to the left,  improve=2.333333, (0 missing)
      it  < 0.17   to the left,  improve=2.333333, (0 missing)
      not < 0.0695 to the left,  improve=2.333333, (0 missing)
  Surrogate splits:
      and < 0.5665 to the left,  agree=1, adj=1, (0 split)
      if. < 0.0175 to the left,  agree=1, adj=1, (0 split)
      it  < 0.17   to the left,  agree=1, adj=1, (0 split)
      not < 0.0695 to the left,  agree=1, adj=1, (0 split)
      of  < 0.7305 to the right, agree=1, adj=1, (0 split)

Node number 7: 10 observations
  predicted class=Madison   expected loss=0  P(node) =0.1960784
    class counts:     0     0     0     0    10
   probabilities: 0.000 0.000 0.000 0.000 1.000 

Node number 12: 3 observations,    complexity param=0.06666667
  predicted class=HM        expected loss=0.3333333  P(node) =0.05882353
    class counts:     0     1     2     0     0
   probabilities: 0.000 0.333 0.667 0.000 0.000 
  left son=24 (1 obs) right son=25 (2 obs)
  Primary splits:
      a   < 0.279  to the right, improve=1.333333, (0 missing)
      all < 0.0385 to the left,  improve=1.333333, (0 missing)
      an  < 0.0485 to the right, improve=1.333333, (0 missing)
      and < 0.4115 to the left,  improve=1.333333, (0 missing)
      any < 0.0215 to the right, improve=1.333333, (0 missing)

Node number 13: 3 observations
  predicted class=Jay       expected loss=0  P(node) =0.05882353
    class counts:     0     0     0     3     0
   probabilities: 0.000 0.000 0.000 1.000 0.000 

Node number 24: 1 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.01960784
    class counts:     0     1     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 25: 2 observations
  predicted class=HM        expected loss=0  P(node) =0.03921569
    class counts:     0     0     2     0     0
   probabilities: 0.000 0.000 1.000 0.000 0.000 
  plotcp(rtree_0) # plot cross-validation results

  printcp(rtree_0) # plot cross-validation results

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 0, maxdepth = 5)

Variables actually used in tree construction:
[1] a    an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

        CP nsplit rel error  xerror    xstd
1 0.600000      0  1.000000 1.00000 0.21693
2 0.200000      1  0.400000 0.46667 0.16383
3 0.133333      2  0.200000 0.53333 0.17314
4 0.066667      3  0.066667 0.53333 0.17314
5 0.000000      4  0.000000 0.46667 0.16383
  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_0,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree

  rsq.rpart(rtree_0) # plot approximate R-squared and relative error for different splits (2 plots)

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 0, maxdepth = 5)

Variables actually used in tree construction:
[1] a    an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

        CP nsplit rel error  xerror    xstd
1 0.600000      0  1.000000 1.00000 0.21693
2 0.200000      1  0.400000 0.46667 0.16383
3 0.133333      2  0.200000 0.53333 0.17314
4 0.066667      3  0.066667 0.53333 0.17314
5 0.000000      4  0.000000 0.46667 0.16383
may not be applicable for this method

NA
  # grow tree  with cp=0 , minsplit = 10, maxdepth = 5
  rtree_1 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 1, maxdepth = 5)

  #summarize rtree values
  summary(rtree_1)
Call:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 1, maxdepth = 5)
  n= 51 

          CP nsplit  rel error    xerror      xstd
1 0.60000000      0 1.00000000 1.0000000 0.2169305
2 0.20000000      1 0.40000000 0.4000000 0.1533930
3 0.13333333      2 0.20000000 0.4666667 0.1638321
4 0.06666667      3 0.06666667 0.4666667 0.1638321
5 0.00000000      4 0.00000000 0.4000000 0.1533930

Variable importance
 upon    an there   and    on    to    no which  been every   if.    it   not    of     a 
   15    13    12    12    10    10     5     5     4     4     2     2     2     2     1 

Node number 1: 51 observations,    complexity param=0.6
  predicted class=Hamilton  expected loss=0.2941176  P(node) =1
    class counts:     0    36     2     3    10
   probabilities: 0.000 0.706 0.039 0.059 0.196 
  left son=2 (35 obs) right son=3 (16 obs)
  Primary splits:
      upon  < 0.0145 to the right, improve=14.497550, (0 missing)
      on    < 0.0915 to the left,  improve=12.450600, (0 missing)
      there < 0.0145 to the right, improve=11.162020, (0 missing)
      to    < 0.499  to the right, improve= 9.435049, (0 missing)
      of    < 0.8705 to the right, improve= 4.936256, (0 missing)
  Surrogate splits:
      there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
      on    < 0.0915 to the left,  agree=0.882, adj=0.625, (0 split)
      to    < 0.474  to the right, agree=0.882, adj=0.625, (0 split)
      an    < 0.064  to the right, agree=0.804, adj=0.375, (0 split)
      and   < 0.421  to the left,  agree=0.804, adj=0.375, (0 split)

Node number 2: 35 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.6862745
    class counts:     0    35     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 3: 16 observations,    complexity param=0.2
  predicted class=Madison   expected loss=0.375  P(node) =0.3137255
    class counts:     0     1     2     3    10
   probabilities: 0.000 0.062 0.125 0.188 0.625 
  left son=6 (6 obs) right son=7 (10 obs)
  Primary splits:
      no    < 0.021  to the left,  improve=5.208333, (0 missing)
      an    < 0.046  to the left,  improve=4.656818, (0 missing)
      which < 0.113  to the left,  improve=4.256818, (0 missing)
      of    < 0.7305 to the left,  improve=3.951923, (0 missing)
      the   < 1.019  to the left,  improve=3.951923, (0 missing)
  Surrogate splits:
      an    < 0.046  to the left,  agree=0.938, adj=0.833, (0 split)
      which < 0.113  to the left,  agree=0.938, adj=0.833, (0 split)
      and   < 0.467  to the right, agree=0.875, adj=0.667, (0 split)
      been  < 0.0275 to the left,  agree=0.875, adj=0.667, (0 split)
      every < 0.0125 to the left,  agree=0.875, adj=0.667, (0 split)

Node number 6: 6 observations,    complexity param=0.1333333
  predicted class=Jay       expected loss=0.5  P(node) =0.1176471
    class counts:     0     1     2     3     0
   probabilities: 0.000 0.167 0.333 0.500 0.000 
  left son=12 (3 obs) right son=13 (3 obs)
  Primary splits:
      an  < 0.032  to the right, improve=2.333333, (0 missing)
      and < 0.5665 to the left,  improve=2.333333, (0 missing)
      if. < 0.0175 to the left,  improve=2.333333, (0 missing)
      it  < 0.17   to the left,  improve=2.333333, (0 missing)
      not < 0.0695 to the left,  improve=2.333333, (0 missing)
  Surrogate splits:
      and < 0.5665 to the left,  agree=1, adj=1, (0 split)
      if. < 0.0175 to the left,  agree=1, adj=1, (0 split)
      it  < 0.17   to the left,  agree=1, adj=1, (0 split)
      not < 0.0695 to the left,  agree=1, adj=1, (0 split)
      of  < 0.7305 to the right, agree=1, adj=1, (0 split)

Node number 7: 10 observations
  predicted class=Madison   expected loss=0  P(node) =0.1960784
    class counts:     0     0     0     0    10
   probabilities: 0.000 0.000 0.000 0.000 1.000 

Node number 12: 3 observations,    complexity param=0.06666667
  predicted class=HM        expected loss=0.3333333  P(node) =0.05882353
    class counts:     0     1     2     0     0
   probabilities: 0.000 0.333 0.667 0.000 0.000 
  left son=24 (1 obs) right son=25 (2 obs)
  Primary splits:
      a   < 0.279  to the right, improve=1.333333, (0 missing)
      all < 0.0385 to the left,  improve=1.333333, (0 missing)
      an  < 0.0485 to the right, improve=1.333333, (0 missing)
      and < 0.4115 to the left,  improve=1.333333, (0 missing)
      any < 0.0215 to the right, improve=1.333333, (0 missing)

Node number 13: 3 observations
  predicted class=Jay       expected loss=0  P(node) =0.05882353
    class counts:     0     0     0     3     0
   probabilities: 0.000 0.000 0.000 1.000 0.000 

Node number 24: 1 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.01960784
    class counts:     0     1     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 25: 2 observations
  predicted class=HM        expected loss=0  P(node) =0.03921569
    class counts:     0     0     2     0     0
   probabilities: 0.000 0.000 1.000 0.000 0.000 
  plotcp(rtree_1) # plot cross-validation results

  printcp(rtree_1) # plot cross-validation results

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 1, maxdepth = 5)

Variables actually used in tree construction:
[1] a    an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

        CP nsplit rel error  xerror    xstd
1 0.600000      0  1.000000 1.00000 0.21693
2 0.200000      1  0.400000 0.40000 0.15339
3 0.133333      2  0.200000 0.46667 0.16383
4 0.066667      3  0.066667 0.46667 0.16383
5 0.000000      4  0.000000 0.40000 0.15339
  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_1,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree

  rsq.rpart(rtree_1) # plot approximate R-squared and relative error for different splits (2 plots)

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 1, maxdepth = 5)

Variables actually used in tree construction:
[1] a    an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

        CP nsplit rel error  xerror    xstd
1 0.600000      0  1.000000 1.00000 0.21693
2 0.200000      1  0.400000 0.40000 0.15339
3 0.133333      2  0.200000 0.46667 0.16383
4 0.066667      3  0.066667 0.46667 0.16383
5 0.000000      4  0.000000 0.40000 0.15339
may not be applicable for this method

  # grow tree  with cp=0 , minsplit = 10, maxdepth = 5
  rtree_2 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 2, maxdepth = 10)

  #summarize rtree values
  summary(rtree_2)
Call:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 2, maxdepth = 10)
  n= 51 

          CP nsplit  rel error    xerror      xstd
1 0.60000000      0 1.00000000 1.0000000 0.2169305
2 0.20000000      1 0.40000000 0.4000000 0.1533930
3 0.13333333      2 0.20000000 0.4666667 0.1638321
4 0.06666667      3 0.06666667 0.4666667 0.1638321
5 0.00000000      4 0.00000000 0.4666667 0.1638321

Variable importance
 upon    an there   and    on    to    no which  been every   if.    it   not    of     a 
   15    13    12    12    10    10     5     5     4     4     2     2     2     2     1 

Node number 1: 51 observations,    complexity param=0.6
  predicted class=Hamilton  expected loss=0.2941176  P(node) =1
    class counts:     0    36     2     3    10
   probabilities: 0.000 0.706 0.039 0.059 0.196 
  left son=2 (35 obs) right son=3 (16 obs)
  Primary splits:
      upon  < 0.0145 to the right, improve=14.497550, (0 missing)
      on    < 0.0915 to the left,  improve=12.450600, (0 missing)
      there < 0.0145 to the right, improve=11.162020, (0 missing)
      to    < 0.499  to the right, improve= 9.435049, (0 missing)
      of    < 0.8705 to the right, improve= 4.936256, (0 missing)
  Surrogate splits:
      there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
      on    < 0.0915 to the left,  agree=0.882, adj=0.625, (0 split)
      to    < 0.474  to the right, agree=0.882, adj=0.625, (0 split)
      an    < 0.064  to the right, agree=0.804, adj=0.375, (0 split)
      and   < 0.421  to the left,  agree=0.804, adj=0.375, (0 split)

Node number 2: 35 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.6862745
    class counts:     0    35     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 3: 16 observations,    complexity param=0.2
  predicted class=Madison   expected loss=0.375  P(node) =0.3137255
    class counts:     0     1     2     3    10
   probabilities: 0.000 0.062 0.125 0.188 0.625 
  left son=6 (6 obs) right son=7 (10 obs)
  Primary splits:
      no    < 0.021  to the left,  improve=5.208333, (0 missing)
      an    < 0.046  to the left,  improve=4.656818, (0 missing)
      which < 0.113  to the left,  improve=4.256818, (0 missing)
      of    < 0.7305 to the left,  improve=3.951923, (0 missing)
      the   < 1.019  to the left,  improve=3.951923, (0 missing)
  Surrogate splits:
      an    < 0.046  to the left,  agree=0.938, adj=0.833, (0 split)
      which < 0.113  to the left,  agree=0.938, adj=0.833, (0 split)
      and   < 0.467  to the right, agree=0.875, adj=0.667, (0 split)
      been  < 0.0275 to the left,  agree=0.875, adj=0.667, (0 split)
      every < 0.0125 to the left,  agree=0.875, adj=0.667, (0 split)

Node number 6: 6 observations,    complexity param=0.1333333
  predicted class=Jay       expected loss=0.5  P(node) =0.1176471
    class counts:     0     1     2     3     0
   probabilities: 0.000 0.167 0.333 0.500 0.000 
  left son=12 (3 obs) right son=13 (3 obs)
  Primary splits:
      an  < 0.032  to the right, improve=2.333333, (0 missing)
      and < 0.5665 to the left,  improve=2.333333, (0 missing)
      if. < 0.0175 to the left,  improve=2.333333, (0 missing)
      it  < 0.17   to the left,  improve=2.333333, (0 missing)
      not < 0.0695 to the left,  improve=2.333333, (0 missing)
  Surrogate splits:
      and < 0.5665 to the left,  agree=1, adj=1, (0 split)
      if. < 0.0175 to the left,  agree=1, adj=1, (0 split)
      it  < 0.17   to the left,  agree=1, adj=1, (0 split)
      not < 0.0695 to the left,  agree=1, adj=1, (0 split)
      of  < 0.7305 to the right, agree=1, adj=1, (0 split)

Node number 7: 10 observations
  predicted class=Madison   expected loss=0  P(node) =0.1960784
    class counts:     0     0     0     0    10
   probabilities: 0.000 0.000 0.000 0.000 1.000 

Node number 12: 3 observations,    complexity param=0.06666667
  predicted class=HM        expected loss=0.3333333  P(node) =0.05882353
    class counts:     0     1     2     0     0
   probabilities: 0.000 0.333 0.667 0.000 0.000 
  left son=24 (1 obs) right son=25 (2 obs)
  Primary splits:
      a   < 0.279  to the right, improve=1.333333, (0 missing)
      all < 0.0385 to the left,  improve=1.333333, (0 missing)
      an  < 0.0485 to the right, improve=1.333333, (0 missing)
      and < 0.4115 to the left,  improve=1.333333, (0 missing)
      any < 0.0215 to the right, improve=1.333333, (0 missing)

Node number 13: 3 observations
  predicted class=Jay       expected loss=0  P(node) =0.05882353
    class counts:     0     0     0     3     0
   probabilities: 0.000 0.000 0.000 1.000 0.000 

Node number 24: 1 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.01960784
    class counts:     0     1     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 25: 2 observations
  predicted class=HM        expected loss=0  P(node) =0.03921569
    class counts:     0     0     2     0     0
   probabilities: 0.000 0.000 1.000 0.000 0.000 
  plotcp(rtree_2) # plot cross-validation results

  printcp(rtree_2) # plot cross-validation results

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 2, maxdepth = 10)

Variables actually used in tree construction:
[1] a    an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

        CP nsplit rel error  xerror    xstd
1 0.600000      0  1.000000 1.00000 0.21693
2 0.200000      1  0.400000 0.40000 0.15339
3 0.133333      2  0.200000 0.46667 0.16383
4 0.066667      3  0.066667 0.46667 0.16383
5 0.000000      4  0.000000 0.46667 0.16383
  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_2,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree

  rsq.rpart(rtree_2) # plot approximate R-squared and relative error for different splits (2 plots)

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 2, maxdepth = 10)

Variables actually used in tree construction:
[1] a    an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

        CP nsplit rel error  xerror    xstd
1 0.600000      0  1.000000 1.00000 0.21693
2 0.200000      1  0.400000 0.40000 0.15339
3 0.133333      2  0.200000 0.46667 0.16383
4 0.066667      3  0.066667 0.46667 0.16383
5 0.000000      4  0.000000 0.46667 0.16383
may not be applicable for this method

  # grow tree  with cp=0 , minsplit = 10, maxdepth = 5
  rtree_3 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 3, maxdepth = 5)

  #summarize rtree values
  summary(rtree_3)
Call:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 3, maxdepth = 5)
  n= 51 

          CP nsplit  rel error    xerror      xstd
1 0.60000000      0 1.00000000 1.0000000 0.2169305
2 0.20000000      1 0.40000000 0.4000000 0.1533930
3 0.13333333      2 0.20000000 0.4000000 0.1533930
4 0.06666667      3 0.06666667 0.4000000 0.1533930
5 0.00000000      4 0.00000000 0.3333333 0.1415753

Variable importance
 upon    an there   and    on    to    no which  been every   if.    it   not    of     a 
   15    13    12    12    10    10     5     5     4     4     2     2     2     2     1 

Node number 1: 51 observations,    complexity param=0.6
  predicted class=Hamilton  expected loss=0.2941176  P(node) =1
    class counts:     0    36     2     3    10
   probabilities: 0.000 0.706 0.039 0.059 0.196 
  left son=2 (35 obs) right son=3 (16 obs)
  Primary splits:
      upon  < 0.0145 to the right, improve=14.497550, (0 missing)
      on    < 0.0915 to the left,  improve=12.450600, (0 missing)
      there < 0.0145 to the right, improve=11.162020, (0 missing)
      to    < 0.499  to the right, improve= 9.435049, (0 missing)
      of    < 0.8705 to the right, improve= 4.936256, (0 missing)
  Surrogate splits:
      there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
      on    < 0.0915 to the left,  agree=0.882, adj=0.625, (0 split)
      to    < 0.474  to the right, agree=0.882, adj=0.625, (0 split)
      an    < 0.064  to the right, agree=0.804, adj=0.375, (0 split)
      and   < 0.421  to the left,  agree=0.804, adj=0.375, (0 split)

Node number 2: 35 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.6862745
    class counts:     0    35     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 3: 16 observations,    complexity param=0.2
  predicted class=Madison   expected loss=0.375  P(node) =0.3137255
    class counts:     0     1     2     3    10
   probabilities: 0.000 0.062 0.125 0.188 0.625 
  left son=6 (6 obs) right son=7 (10 obs)
  Primary splits:
      no    < 0.021  to the left,  improve=5.208333, (0 missing)
      an    < 0.046  to the left,  improve=4.656818, (0 missing)
      which < 0.113  to the left,  improve=4.256818, (0 missing)
      of    < 0.7305 to the left,  improve=3.951923, (0 missing)
      the   < 1.019  to the left,  improve=3.951923, (0 missing)
  Surrogate splits:
      an    < 0.046  to the left,  agree=0.938, adj=0.833, (0 split)
      which < 0.113  to the left,  agree=0.938, adj=0.833, (0 split)
      and   < 0.467  to the right, agree=0.875, adj=0.667, (0 split)
      been  < 0.0275 to the left,  agree=0.875, adj=0.667, (0 split)
      every < 0.0125 to the left,  agree=0.875, adj=0.667, (0 split)

Node number 6: 6 observations,    complexity param=0.1333333
  predicted class=Jay       expected loss=0.5  P(node) =0.1176471
    class counts:     0     1     2     3     0
   probabilities: 0.000 0.167 0.333 0.500 0.000 
  left son=12 (3 obs) right son=13 (3 obs)
  Primary splits:
      an  < 0.032  to the right, improve=2.333333, (0 missing)
      and < 0.5665 to the left,  improve=2.333333, (0 missing)
      if. < 0.0175 to the left,  improve=2.333333, (0 missing)
      it  < 0.17   to the left,  improve=2.333333, (0 missing)
      not < 0.0695 to the left,  improve=2.333333, (0 missing)
  Surrogate splits:
      and < 0.5665 to the left,  agree=1, adj=1, (0 split)
      if. < 0.0175 to the left,  agree=1, adj=1, (0 split)
      it  < 0.17   to the left,  agree=1, adj=1, (0 split)
      not < 0.0695 to the left,  agree=1, adj=1, (0 split)
      of  < 0.7305 to the right, agree=1, adj=1, (0 split)

Node number 7: 10 observations
  predicted class=Madison   expected loss=0  P(node) =0.1960784
    class counts:     0     0     0     0    10
   probabilities: 0.000 0.000 0.000 0.000 1.000 

Node number 12: 3 observations,    complexity param=0.06666667
  predicted class=HM        expected loss=0.3333333  P(node) =0.05882353
    class counts:     0     1     2     0     0
   probabilities: 0.000 0.333 0.667 0.000 0.000 
  left son=24 (1 obs) right son=25 (2 obs)
  Primary splits:
      a   < 0.279  to the right, improve=1.333333, (0 missing)
      all < 0.0385 to the left,  improve=1.333333, (0 missing)
      an  < 0.0485 to the right, improve=1.333333, (0 missing)
      and < 0.4115 to the left,  improve=1.333333, (0 missing)
      any < 0.0215 to the right, improve=1.333333, (0 missing)

Node number 13: 3 observations
  predicted class=Jay       expected loss=0  P(node) =0.05882353
    class counts:     0     0     0     3     0
   probabilities: 0.000 0.000 0.000 1.000 0.000 

Node number 24: 1 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.01960784
    class counts:     0     1     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 25: 2 observations
  predicted class=HM        expected loss=0  P(node) =0.03921569
    class counts:     0     0     2     0     0
   probabilities: 0.000 0.000 1.000 0.000 0.000 
  plotcp(rtree_3) # plot cross-validation results

  printcp(rtree_3) # plot cross-validation results

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 3, maxdepth = 5)

Variables actually used in tree construction:
[1] a    an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

        CP nsplit rel error  xerror    xstd
1 0.600000      0  1.000000 1.00000 0.21693
2 0.200000      1  0.400000 0.40000 0.15339
3 0.133333      2  0.200000 0.40000 0.15339
4 0.066667      3  0.066667 0.40000 0.15339
5 0.000000      4  0.000000 0.33333 0.14158
  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_3,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree

  rsq.rpart(rtree_3) # plot approximate R-squared and relative error for different splits (2 plots)

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 3, maxdepth = 5)

Variables actually used in tree construction:
[1] a    an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

        CP nsplit rel error  xerror    xstd
1 0.600000      0  1.000000 1.00000 0.21693
2 0.200000      1  0.400000 0.40000 0.15339
3 0.133333      2  0.200000 0.40000 0.15339
4 0.066667      3  0.066667 0.40000 0.15339
5 0.000000      4  0.000000 0.33333 0.14158
may not be applicable for this method

  # grow tree  with cp=0 , minsplit = 10, maxdepth = 5
  rtree_4 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 4, maxdepth = 5)

  #summarize rtree values
  summary(rtree_4)
Call:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 4, maxdepth = 5)
  n= 51 

         CP nsplit  rel error    xerror      xstd
1 0.6000000      0 1.00000000 1.0000000 0.2169305
2 0.2000000      1 0.40000000 0.4000000 0.1533930
3 0.1333333      2 0.20000000 0.4666667 0.1638321
4 0.0000000      3 0.06666667 0.4000000 0.1533930

Variable importance
 upon    an there   and    on    to    no which  been every   if.    it   not    of 
   15    13    13    12    10    10     6     5     4     4     2     2     2     2 

Node number 1: 51 observations,    complexity param=0.6
  predicted class=Hamilton  expected loss=0.2941176  P(node) =1
    class counts:     0    36     2     3    10
   probabilities: 0.000 0.706 0.039 0.059 0.196 
  left son=2 (35 obs) right son=3 (16 obs)
  Primary splits:
      upon  < 0.0145 to the right, improve=14.497550, (0 missing)
      on    < 0.0915 to the left,  improve=12.450600, (0 missing)
      there < 0.0145 to the right, improve=11.162020, (0 missing)
      to    < 0.499  to the right, improve= 9.435049, (0 missing)
      of    < 0.8705 to the right, improve= 4.936256, (0 missing)
  Surrogate splits:
      there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
      on    < 0.0915 to the left,  agree=0.882, adj=0.625, (0 split)
      to    < 0.474  to the right, agree=0.882, adj=0.625, (0 split)
      an    < 0.064  to the right, agree=0.804, adj=0.375, (0 split)
      and   < 0.421  to the left,  agree=0.804, adj=0.375, (0 split)

Node number 2: 35 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.6862745
    class counts:     0    35     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 3: 16 observations,    complexity param=0.2
  predicted class=Madison   expected loss=0.375  P(node) =0.3137255
    class counts:     0     1     2     3    10
   probabilities: 0.000 0.062 0.125 0.188 0.625 
  left son=6 (6 obs) right son=7 (10 obs)
  Primary splits:
      no    < 0.021  to the left,  improve=5.208333, (0 missing)
      an    < 0.046  to the left,  improve=4.656818, (0 missing)
      which < 0.113  to the left,  improve=4.256818, (0 missing)
      of    < 0.7305 to the left,  improve=3.951923, (0 missing)
      the   < 1.019  to the left,  improve=3.951923, (0 missing)
  Surrogate splits:
      an    < 0.046  to the left,  agree=0.938, adj=0.833, (0 split)
      which < 0.113  to the left,  agree=0.938, adj=0.833, (0 split)
      and   < 0.467  to the right, agree=0.875, adj=0.667, (0 split)
      been  < 0.0275 to the left,  agree=0.875, adj=0.667, (0 split)
      every < 0.0125 to the left,  agree=0.875, adj=0.667, (0 split)

Node number 6: 6 observations,    complexity param=0.1333333
  predicted class=Jay       expected loss=0.5  P(node) =0.1176471
    class counts:     0     1     2     3     0
   probabilities: 0.000 0.167 0.333 0.500 0.000 
  left son=12 (3 obs) right son=13 (3 obs)
  Primary splits:
      an  < 0.032  to the right, improve=2.333333, (0 missing)
      and < 0.5665 to the left,  improve=2.333333, (0 missing)
      if. < 0.0175 to the left,  improve=2.333333, (0 missing)
      it  < 0.17   to the left,  improve=2.333333, (0 missing)
      not < 0.0695 to the left,  improve=2.333333, (0 missing)
  Surrogate splits:
      and < 0.5665 to the left,  agree=1, adj=1, (0 split)
      if. < 0.0175 to the left,  agree=1, adj=1, (0 split)
      it  < 0.17   to the left,  agree=1, adj=1, (0 split)
      not < 0.0695 to the left,  agree=1, adj=1, (0 split)
      of  < 0.7305 to the right, agree=1, adj=1, (0 split)

Node number 7: 10 observations
  predicted class=Madison   expected loss=0  P(node) =0.1960784
    class counts:     0     0     0     0    10
   probabilities: 0.000 0.000 0.000 0.000 1.000 

Node number 12: 3 observations
  predicted class=HM        expected loss=0.3333333  P(node) =0.05882353
    class counts:     0     1     2     0     0
   probabilities: 0.000 0.333 0.667 0.000 0.000 

Node number 13: 3 observations
  predicted class=Jay       expected loss=0  P(node) =0.05882353
    class counts:     0     0     0     3     0
   probabilities: 0.000 0.000 0.000 1.000 0.000 
  plotcp(rtree_4) # plot cross-validation results

  printcp(rtree_4) # plot cross-validation results

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 4, maxdepth = 5)

Variables actually used in tree construction:
[1] an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

       CP nsplit rel error  xerror    xstd
1 0.60000      0  1.000000 1.00000 0.21693
2 0.20000      1  0.400000 0.40000 0.15339
3 0.13333      2  0.200000 0.46667 0.16383
4 0.00000      3  0.066667 0.40000 0.15339
  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_4,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree

  rsq.rpart(rtree_4) # plot approximate R-squared and relative error for different splits (2 plots)

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 4, maxdepth = 5)

Variables actually used in tree construction:
[1] an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

       CP nsplit rel error  xerror    xstd
1 0.60000      0  1.000000 1.00000 0.21693
2 0.20000      1  0.400000 0.40000 0.15339
3 0.13333      2  0.200000 0.46667 0.16383
4 0.00000      3  0.066667 0.40000 0.15339
may not be applicable for this method

 # grow tree  with cp=0 , minsplit = 3, maxdepth = 5, minbucket = 1
  rtree_5 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 3, maxdepth = 5, minbucket = 1)

  #summarize rtree values
  summary(rtree_5)
Call:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 3, maxdepth = 5, minbucket = 1)
  n= 51 

          CP nsplit  rel error    xerror      xstd
1 0.60000000      0 1.00000000 1.0000000 0.2169305
2 0.20000000      1 0.40000000 0.4000000 0.1533930
3 0.13333333      2 0.20000000 0.5333333 0.1731422
4 0.06666667      3 0.06666667 0.6000000 0.1814970
5 0.00000000      4 0.00000000 0.5333333 0.1731422

Variable importance
 upon    an there   and    on    to    no which  been every   if.    it   not    of     a 
   15    13    12    12    10    10     5     5     4     4     2     2     2     2     1 

Node number 1: 51 observations,    complexity param=0.6
  predicted class=Hamilton  expected loss=0.2941176  P(node) =1
    class counts:     0    36     2     3    10
   probabilities: 0.000 0.706 0.039 0.059 0.196 
  left son=2 (35 obs) right son=3 (16 obs)
  Primary splits:
      upon  < 0.0145 to the right, improve=14.497550, (0 missing)
      on    < 0.0915 to the left,  improve=12.450600, (0 missing)
      there < 0.0145 to the right, improve=11.162020, (0 missing)
      to    < 0.499  to the right, improve= 9.435049, (0 missing)
      of    < 0.8705 to the right, improve= 4.936256, (0 missing)
  Surrogate splits:
      there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
      on    < 0.0915 to the left,  agree=0.882, adj=0.625, (0 split)
      to    < 0.474  to the right, agree=0.882, adj=0.625, (0 split)
      an    < 0.064  to the right, agree=0.804, adj=0.375, (0 split)
      and   < 0.421  to the left,  agree=0.804, adj=0.375, (0 split)

Node number 2: 35 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.6862745
    class counts:     0    35     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 3: 16 observations,    complexity param=0.2
  predicted class=Madison   expected loss=0.375  P(node) =0.3137255
    class counts:     0     1     2     3    10
   probabilities: 0.000 0.062 0.125 0.188 0.625 
  left son=6 (6 obs) right son=7 (10 obs)
  Primary splits:
      no    < 0.021  to the left,  improve=5.208333, (0 missing)
      an    < 0.046  to the left,  improve=4.656818, (0 missing)
      which < 0.113  to the left,  improve=4.256818, (0 missing)
      of    < 0.7305 to the left,  improve=3.951923, (0 missing)
      the   < 1.019  to the left,  improve=3.951923, (0 missing)
  Surrogate splits:
      an    < 0.046  to the left,  agree=0.938, adj=0.833, (0 split)
      which < 0.113  to the left,  agree=0.938, adj=0.833, (0 split)
      and   < 0.467  to the right, agree=0.875, adj=0.667, (0 split)
      been  < 0.0275 to the left,  agree=0.875, adj=0.667, (0 split)
      every < 0.0125 to the left,  agree=0.875, adj=0.667, (0 split)

Node number 6: 6 observations,    complexity param=0.1333333
  predicted class=Jay       expected loss=0.5  P(node) =0.1176471
    class counts:     0     1     2     3     0
   probabilities: 0.000 0.167 0.333 0.500 0.000 
  left son=12 (3 obs) right son=13 (3 obs)
  Primary splits:
      an  < 0.032  to the right, improve=2.333333, (0 missing)
      and < 0.5665 to the left,  improve=2.333333, (0 missing)
      if. < 0.0175 to the left,  improve=2.333333, (0 missing)
      it  < 0.17   to the left,  improve=2.333333, (0 missing)
      not < 0.0695 to the left,  improve=2.333333, (0 missing)
  Surrogate splits:
      and < 0.5665 to the left,  agree=1, adj=1, (0 split)
      if. < 0.0175 to the left,  agree=1, adj=1, (0 split)
      it  < 0.17   to the left,  agree=1, adj=1, (0 split)
      not < 0.0695 to the left,  agree=1, adj=1, (0 split)
      of  < 0.7305 to the right, agree=1, adj=1, (0 split)

Node number 7: 10 observations
  predicted class=Madison   expected loss=0  P(node) =0.1960784
    class counts:     0     0     0     0    10
   probabilities: 0.000 0.000 0.000 0.000 1.000 

Node number 12: 3 observations,    complexity param=0.06666667
  predicted class=HM        expected loss=0.3333333  P(node) =0.05882353
    class counts:     0     1     2     0     0
   probabilities: 0.000 0.333 0.667 0.000 0.000 
  left son=24 (1 obs) right son=25 (2 obs)
  Primary splits:
      a   < 0.279  to the right, improve=1.333333, (0 missing)
      all < 0.0385 to the left,  improve=1.333333, (0 missing)
      an  < 0.0485 to the right, improve=1.333333, (0 missing)
      and < 0.4115 to the left,  improve=1.333333, (0 missing)
      any < 0.0215 to the right, improve=1.333333, (0 missing)

Node number 13: 3 observations
  predicted class=Jay       expected loss=0  P(node) =0.05882353
    class counts:     0     0     0     3     0
   probabilities: 0.000 0.000 0.000 1.000 0.000 

Node number 24: 1 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.01960784
    class counts:     0     1     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 25: 2 observations
  predicted class=HM        expected loss=0  P(node) =0.03921569
    class counts:     0     0     2     0     0
   probabilities: 0.000 0.000 1.000 0.000 0.000 
  plotcp(rtree_5) # plot cross-validation results

  printcp(rtree_5) # plot cross-validation results

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 3, maxdepth = 5, minbucket = 1)

Variables actually used in tree construction:
[1] a    an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

        CP nsplit rel error  xerror    xstd
1 0.600000      0  1.000000 1.00000 0.21693
2 0.200000      1  0.400000 0.40000 0.15339
3 0.133333      2  0.200000 0.53333 0.17314
4 0.066667      3  0.066667 0.60000 0.18150
5 0.000000      4  0.000000 0.53333 0.17314
  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_5,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree

  rsq.rpart(rtree_5) # plot approximate R-squared and relative error for different splits (2 plots)

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 3, maxdepth = 5, minbucket = 1)

Variables actually used in tree construction:
[1] a    an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

        CP nsplit rel error  xerror    xstd
1 0.600000      0  1.000000 1.00000 0.21693
2 0.200000      1  0.400000 0.40000 0.15339
3 0.133333      2  0.200000 0.53333 0.17314
4 0.066667      3  0.066667 0.60000 0.18150
5 0.000000      4  0.000000 0.53333 0.17314
may not be applicable for this method

  # grow tree  with cp=0 , minsplit = 10, maxdepth = 5
  rtree_10 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 4, maxdepth = 3,minbucket = round(5/3))

  #summarize rtree values
  summary(rtree_10)
Call:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 4, maxdepth = 3, minbucket = round(5/3))
  n= 51 

         CP nsplit  rel error xerror      xstd
1 0.6000000      0 1.00000000    1.0 0.2169305
2 0.2000000      1 0.40000000    0.4 0.1533930
3 0.1333333      2 0.20000000    0.4 0.1533930
4 0.0000000      3 0.06666667    0.4 0.1533930

Variable importance
 upon    an there   and    on    to    no which  been every   if.    it   not    of 
   15    13    13    12    10    10     6     5     4     4     2     2     2     2 

Node number 1: 51 observations,    complexity param=0.6
  predicted class=Hamilton  expected loss=0.2941176  P(node) =1
    class counts:     0    36     2     3    10
   probabilities: 0.000 0.706 0.039 0.059 0.196 
  left son=2 (35 obs) right son=3 (16 obs)
  Primary splits:
      upon  < 0.0145 to the right, improve=14.497550, (0 missing)
      on    < 0.0915 to the left,  improve=12.450600, (0 missing)
      there < 0.0145 to the right, improve=11.162020, (0 missing)
      to    < 0.499  to the right, improve= 9.435049, (0 missing)
      of    < 0.8705 to the right, improve= 4.936256, (0 missing)
  Surrogate splits:
      there < 0.0145 to the right, agree=0.941, adj=0.812, (0 split)
      on    < 0.0915 to the left,  agree=0.882, adj=0.625, (0 split)
      to    < 0.474  to the right, agree=0.882, adj=0.625, (0 split)
      an    < 0.064  to the right, agree=0.804, adj=0.375, (0 split)
      and   < 0.421  to the left,  agree=0.804, adj=0.375, (0 split)

Node number 2: 35 observations
  predicted class=Hamilton  expected loss=0  P(node) =0.6862745
    class counts:     0    35     0     0     0
   probabilities: 0.000 1.000 0.000 0.000 0.000 

Node number 3: 16 observations,    complexity param=0.2
  predicted class=Madison   expected loss=0.375  P(node) =0.3137255
    class counts:     0     1     2     3    10
   probabilities: 0.000 0.062 0.125 0.188 0.625 
  left son=6 (6 obs) right son=7 (10 obs)
  Primary splits:
      no    < 0.021  to the left,  improve=5.208333, (0 missing)
      an    < 0.046  to the left,  improve=4.656818, (0 missing)
      which < 0.113  to the left,  improve=4.256818, (0 missing)
      of    < 0.7305 to the left,  improve=3.951923, (0 missing)
      the   < 1.019  to the left,  improve=3.951923, (0 missing)
  Surrogate splits:
      an    < 0.046  to the left,  agree=0.938, adj=0.833, (0 split)
      which < 0.113  to the left,  agree=0.938, adj=0.833, (0 split)
      and   < 0.467  to the right, agree=0.875, adj=0.667, (0 split)
      been  < 0.0275 to the left,  agree=0.875, adj=0.667, (0 split)
      every < 0.0125 to the left,  agree=0.875, adj=0.667, (0 split)

Node number 6: 6 observations,    complexity param=0.1333333
  predicted class=Jay       expected loss=0.5  P(node) =0.1176471
    class counts:     0     1     2     3     0
   probabilities: 0.000 0.167 0.333 0.500 0.000 
  left son=12 (3 obs) right son=13 (3 obs)
  Primary splits:
      an  < 0.032  to the right, improve=2.333333, (0 missing)
      and < 0.5665 to the left,  improve=2.333333, (0 missing)
      if. < 0.0175 to the left,  improve=2.333333, (0 missing)
      it  < 0.17   to the left,  improve=2.333333, (0 missing)
      not < 0.0695 to the left,  improve=2.333333, (0 missing)
  Surrogate splits:
      and < 0.5665 to the left,  agree=1, adj=1, (0 split)
      if. < 0.0175 to the left,  agree=1, adj=1, (0 split)
      it  < 0.17   to the left,  agree=1, adj=1, (0 split)
      not < 0.0695 to the left,  agree=1, adj=1, (0 split)
      of  < 0.7305 to the right, agree=1, adj=1, (0 split)

Node number 7: 10 observations
  predicted class=Madison   expected loss=0  P(node) =0.1960784
    class counts:     0     0     0     0    10
   probabilities: 0.000 0.000 0.000 0.000 1.000 

Node number 12: 3 observations
  predicted class=HM        expected loss=0.3333333  P(node) =0.05882353
    class counts:     0     1     2     0     0
   probabilities: 0.000 0.333 0.667 0.000 0.000 

Node number 13: 3 observations
  predicted class=Jay       expected loss=0  P(node) =0.05882353
    class counts:     0     0     0     3     0
   probabilities: 0.000 0.000 0.000 1.000 0.000 
  plotcp(rtree_10) # plot cross-validation results

  printcp(rtree_10) # plot cross-validation results

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 4, maxdepth = 3, minbucket = round(5/3))

Variables actually used in tree construction:
[1] an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

       CP nsplit rel error xerror    xstd
1 0.60000      0  1.000000    1.0 0.21693
2 0.20000      1  0.400000    0.4 0.15339
3 0.13333      2  0.200000    0.4 0.15339
4 0.00000      3  0.066667    0.4 0.15339
  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_10,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree

  rsq.rpart(rtree_10) # plot approximate R-squared and relative error for different splits (2 plots)

Classification tree:
rpart(formula = author ~ ., data = train_data, method = "class", 
    cp = 0, minsplit = 4, maxdepth = 3, minbucket = round(5/3))

Variables actually used in tree construction:
[1] an   no   upon

Root node error: 15/51 = 0.29412

n= 51 

       CP nsplit rel error xerror    xstd
1 0.60000      0  1.000000    1.0 0.21693
2 0.20000      1  0.400000    0.4 0.15339
3 0.13333      2  0.200000    0.4 0.15339
4 0.00000      3  0.066667    0.4 0.15339
may not be applicable for this method

  
  cat("\nArticles by Author:")

Articles by Author:
  table(fedPapersDF_authors$author)

   dispt Hamilton       HM      Jay  Madison 
       0       51        3        5       15 
  cat("\nTrain_data - Articles by Author:")

Train_data - Articles by Author:
  table(train_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       36        2        3       10 
  cat("\nTest_data - Articles by Author:")

Test_data - Articles by Author:
  table(test_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       15        1        2        5 
  
  predict_unseen <- predict(rtree_10, test_data, type = 'class')
  # predict_unseen
  table_mat <- table(test_data$author, predict_unseen)
  cat("\n\nPrediction results : Confusion Matrix \n\n")


Prediction results : Confusion Matrix 
  # table_mat
  confusionMatrix(table_mat)
Confusion Matrix and Statistics

          predict_unseen
           dispt Hamilton HM Jay Madison
  dispt        0        0  0   0       0
  Hamilton     0       15  0   0       0
  HM           0        0  0   0       1
  Jay          0        0  0   1       1
  Madison      0        1  1   0       3

Overall Statistics
                                          
               Accuracy : 0.8261          
                 95% CI : (0.6122, 0.9505)
    No Information Rate : 0.6957          
    P-Value [Acc > NIR] : 0.1262          
                                          
                  Kappa : 0.6475          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity                    NA          0.9375   0.00000    1.00000         0.6000
Specificity                     1          1.0000   0.95455    0.95455         0.8889
Pos Pred Value                 NA          1.0000   0.00000    0.50000         0.6000
Neg Pred Value                 NA          0.8750   0.95455    1.00000         0.8889
Prevalence                      0          0.6957   0.04348    0.04348         0.2174
Detection Rate                  0          0.6522   0.00000    0.04348         0.1304
Detection Prevalence            0          0.6522   0.04348    0.08696         0.2174
Balanced Accuracy              NA          0.9688   0.47727    0.97727         0.7444
  
  
  
# Section 3: Prediction | train data

  cat("\nArticles by Author:")

Articles by Author:
  table(fedPapersDF_authors$author)

   dispt Hamilton       HM      Jay  Madison 
       0       51        3        5       15 
  cat("\nTrain_data - Articles by Author:")

Train_data - Articles by Author:
  table(train_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       36        2        3       10 
  cat("\nTest_data - Articles by Author:")

Test_data - Articles by Author:
  table(test_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       15        1        2        5 
  predict_unseen <- predict(rtree_0, train_data, type = 'class')
  # predict_unseen
  table_mat <- table(train_data$author, predict_unseen)
  cat("\n\nPrediction results : Confusion Matrix \n\n")


Prediction results : Confusion Matrix 
  # table_mat
  confusionMatrix(table_mat)
Confusion Matrix and Statistics

          predict_unseen
           dispt Hamilton HM Jay Madison
  dispt        0        0  0   0       0
  Hamilton     0       36  0   0       0
  HM           0        0  2   0       0
  Jay          0        0  0   3       0
  Madison      0        0  0   0      10

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9302, 1)
    No Information Rate : 0.7059     
    P-Value [Acc > NIR] : 1.929e-08  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity                    NA          1.0000   1.00000    1.00000         1.0000
Specificity                     1          1.0000   1.00000    1.00000         1.0000
Pos Pred Value                 NA          1.0000   1.00000    1.00000         1.0000
Neg Pred Value                 NA          1.0000   1.00000    1.00000         1.0000
Prevalence                      0          0.7059   0.03922    0.05882         0.1961
Detection Rate                  0          0.7059   0.03922    0.05882         0.1961
Detection Prevalence            0          0.7059   0.03922    0.05882         0.1961
Balanced Accuracy              NA          1.0000   1.00000    1.00000         1.0000
# Section 3: Prediction | Test Data

  cat("\nArticles by Author:")

Articles by Author:
  table(fedPapersDF_authors$author)

   dispt Hamilton       HM      Jay  Madison 
       0       51        3        5       15 
  cat("\nTrain_data - Articles by Author:")

Train_data - Articles by Author:
  table(train_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       36        2        3       10 
  cat("\nTest_data - Articles by Author:")

Test_data - Articles by Author:
  table(test_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       15        1        2        5 
  predict_DT0 <- predict(rtree_0, test_data, type = 'class')
  # predict_unseen
  table_DT0 <- table(test_data$author, predict_DT0)
  cat("\n\nPrediction results : Confusion Matrix \n\n")


Prediction results : Confusion Matrix 
  # table_mat
  confusionMatrix(table_DT0)
Confusion Matrix and Statistics

          predict_DT0
           dispt Hamilton HM Jay Madison
  dispt        0        0  0   0       0
  Hamilton     0       15  0   0       0
  HM           0        0  0   0       1
  Jay          0        0  0   1       1
  Madison      0        2  0   0       3

Overall Statistics
                                          
               Accuracy : 0.8261          
                 95% CI : (0.6122, 0.9505)
    No Information Rate : 0.7391          
    P-Value [Acc > NIR] : 0.2447          
                                          
                  Kappa : 0.6275          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity                    NA          0.8824        NA    1.00000         0.6000
Specificity                     1          1.0000   0.95652    0.95455         0.8889
Pos Pred Value                 NA          1.0000        NA    0.50000         0.6000
Neg Pred Value                 NA          0.7500        NA    1.00000         0.8889
Prevalence                      0          0.7391   0.00000    0.04348         0.2174
Detection Rate                  0          0.6522   0.00000    0.04348         0.1304
Detection Prevalence            0          0.6522   0.04348    0.08696         0.2174
Balanced Accuracy              NA          0.9412        NA    0.97727         0.7444
# Section 3: Prediction | Test Data

  cat("\nArticles by Author:")

Articles by Author:
  table(fedPapersDF_authors$author)

   dispt Hamilton       HM      Jay  Madison 
       0       51        3        5       15 
  cat("\nTrain_data - Articles by Author:")

Train_data - Articles by Author:
  table(train_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       36        2        3       10 
  cat("\nTest_data - Articles by Author:")

Test_data - Articles by Author:
  table(test_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       15        1        2        5 
  predict_DT1 <- predict(rtree_1, test_data, type = 'class')
  # predict_unseen
  table_DT1 <- table(test_data$author, predict_DT1)
  cat("\n\nPrediction results : Confusion Matrix \n\n")


Prediction results : Confusion Matrix 
  # table_mat
  confusionMatrix(table_DT1)
Confusion Matrix and Statistics

          predict_DT1
           dispt Hamilton HM Jay Madison
  dispt        0        0  0   0       0
  Hamilton     0       15  0   0       0
  HM           0        0  0   0       1
  Jay          0        0  0   1       1
  Madison      0        2  0   0       3

Overall Statistics
                                          
               Accuracy : 0.8261          
                 95% CI : (0.6122, 0.9505)
    No Information Rate : 0.7391          
    P-Value [Acc > NIR] : 0.2447          
                                          
                  Kappa : 0.6275          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity                    NA          0.8824        NA    1.00000         0.6000
Specificity                     1          1.0000   0.95652    0.95455         0.8889
Pos Pred Value                 NA          1.0000        NA    0.50000         0.6000
Neg Pred Value                 NA          0.7500        NA    1.00000         0.8889
Prevalence                      0          0.7391   0.00000    0.04348         0.2174
Detection Rate                  0          0.6522   0.00000    0.04348         0.1304
Detection Prevalence            0          0.6522   0.04348    0.08696         0.2174
Balanced Accuracy              NA          0.9412        NA    0.97727         0.7444
# Section 3: Prediction | Test Data

  cat("\nArticles by Author:")

Articles by Author:
  table(fedPapersDF_authors$author)

   dispt Hamilton       HM      Jay  Madison 
       0       51        3        5       15 
  cat("\nTrain_data - Articles by Author:")

Train_data - Articles by Author:
  table(train_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       36        2        3       10 
  cat("\nTest_data - Articles by Author:")

Test_data - Articles by Author:
  table(test_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       15        1        2        5 
  predict_DT2 <- predict(rtree_2, test_data, type = 'class')
  # predict_unseen
  table_DT2 <- table(test_data$author, predict_DT2)
  cat("\n\nPrediction results : Confusion Matrix \n\n")


Prediction results : Confusion Matrix 
  # table_mat
  confusionMatrix(table_DT2)
Confusion Matrix and Statistics

          predict_DT2
           dispt Hamilton HM Jay Madison
  dispt        0        0  0   0       0
  Hamilton     0       15  0   0       0
  HM           0        0  0   0       1
  Jay          0        0  0   1       1
  Madison      0        2  0   0       3

Overall Statistics
                                          
               Accuracy : 0.8261          
                 95% CI : (0.6122, 0.9505)
    No Information Rate : 0.7391          
    P-Value [Acc > NIR] : 0.2447          
                                          
                  Kappa : 0.6275          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity                    NA          0.8824        NA    1.00000         0.6000
Specificity                     1          1.0000   0.95652    0.95455         0.8889
Pos Pred Value                 NA          1.0000        NA    0.50000         0.6000
Neg Pred Value                 NA          0.7500        NA    1.00000         0.8889
Prevalence                      0          0.7391   0.00000    0.04348         0.2174
Detection Rate                  0          0.6522   0.00000    0.04348         0.1304
Detection Prevalence            0          0.6522   0.04348    0.08696         0.2174
Balanced Accuracy              NA          0.9412        NA    0.97727         0.7444
# Section 3: Prediction | Test Data

  cat("\nArticles by Author:")

Articles by Author:
  table(fedPapersDF_authors$author)

   dispt Hamilton       HM      Jay  Madison 
       0       51        3        5       15 
  cat("\nTrain_data - Articles by Author:")

Train_data - Articles by Author:
  table(train_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       36        2        3       10 
  cat("\nTest_data - Articles by Author:")

Test_data - Articles by Author:
  table(test_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       15        1        2        5 
  predict_DT5 <- predict(rtree_5, test_data, type = 'class')
  # predict_unseen
  table_DT5 <- table(test_data$author, predict_DT5)
  cat("\n\nPrediction results : Confusion Matrix \n\n")


Prediction results : Confusion Matrix 
  # table_mat
  confusionMatrix(table_DT5)
Confusion Matrix and Statistics

          predict_DT5
           dispt Hamilton HM Jay Madison
  dispt        0        0  0   0       0
  Hamilton     0       15  0   0       0
  HM           0        0  0   0       1
  Jay          0        0  0   1       1
  Madison      0        2  0   0       3

Overall Statistics
                                          
               Accuracy : 0.8261          
                 95% CI : (0.6122, 0.9505)
    No Information Rate : 0.7391          
    P-Value [Acc > NIR] : 0.2447          
                                          
                  Kappa : 0.6275          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity                    NA          0.8824        NA    1.00000         0.6000
Specificity                     1          1.0000   0.95652    0.95455         0.8889
Pos Pred Value                 NA          1.0000        NA    0.50000         0.6000
Neg Pred Value                 NA          0.7500        NA    1.00000         0.8889
Prevalence                      0          0.7391   0.00000    0.04348         0.2174
Detection Rate                  0          0.6522   0.00000    0.04348         0.1304
Detection Prevalence            0          0.6522   0.04348    0.08696         0.2174
Balanced Accuracy              NA          0.9412        NA    0.97727         0.7444
# Section 3: Prediction | Test Data

  cat("\nArticles by Author:")

Articles by Author:
  table(fedPapersDF_authors$author)

   dispt Hamilton       HM      Jay  Madison 
       0       51        3        5       15 
  cat("\nTrain_data - Articles by Author:")

Train_data - Articles by Author:
  table(train_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       36        2        3       10 
  cat("\nTest_data - Articles by Author:")

Test_data - Articles by Author:
  table(test_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       15        1        2        5 
  predict_DT10 <- predict(rtree_10, test_data, type = 'class')
  # predict_unseen
  table_DT10 <- table(test_data$author, predict_DT10)
  cat("\n\nPrediction results : Confusion Matrix \n\n")


Prediction results : Confusion Matrix 
  # table_mat
  confusionMatrix(table_DT10)
Confusion Matrix and Statistics

          predict_DT10
           dispt Hamilton HM Jay Madison
  dispt        0        0  0   0       0
  Hamilton     0       15  0   0       0
  HM           0        0  0   0       1
  Jay          0        0  0   1       1
  Madison      0        1  1   0       3

Overall Statistics
                                          
               Accuracy : 0.8261          
                 95% CI : (0.6122, 0.9505)
    No Information Rate : 0.6957          
    P-Value [Acc > NIR] : 0.1262          
                                          
                  Kappa : 0.6475          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: dispt Class: Hamilton Class: HM Class: Jay Class: Madison
Sensitivity                    NA          0.9375   0.00000    1.00000         0.6000
Specificity                     1          1.0000   0.95455    0.95455         0.8889
Pos Pred Value                 NA          1.0000   0.00000    0.50000         0.6000
Neg Pred Value                 NA          0.8750   0.95455    1.00000         0.8889
Prevalence                      0          0.6957   0.04348    0.04348         0.2174
Detection Rate                  0          0.6522   0.00000    0.04348         0.1304
Detection Prevalence            0          0.6522   0.04348    0.08696         0.2174
Balanced Accuracy              NA          0.9688   0.47727    0.97727         0.7444
# Section 3: Prediction | Disputed Data
  cat("\nDisputed Articles by Author:")

Disputed Articles by Author:
  table(fedPapersDF_Dispt$author)

   dispt Hamilton       HM      Jay  Madison 
      11        0        0        0        0 
  cat("\nArticles by Author:")

Articles by Author:
  table(fedPapersDF_authors$author)

   dispt Hamilton       HM      Jay  Madison 
       0       51        3        5       15 
  cat("\nTrain_data - Articles by Author:")

Train_data - Articles by Author:
  table(train_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       36        2        3       10 
  cat("\nTest_data - Articles by Author:")

Test_data - Articles by Author:
  table(test_data$author)

   dispt Hamilton       HM      Jay  Madison 
       0       15        1        2        5 
  predict_final <- predict(rtree_5, fedPapersDF_Dispt, type = 'class')
  table_final <- table(fedPapersDF_Dispt$author, predict_final)
  cat("\n\nPrediction results : \n\n")


Prediction results : 
  table_final 
          predict_final
           dispt Hamilton HM Jay Madison
  dispt        0        0  2   1       8
  Hamilton     0        0  0   0       0
  HM           0        0  0   0       0
  Jay          0        0  0   0       0
  Madison      0        0  0   0       0
predict_finaldf <- data.frame(predict_final)
  cat("\n\nPrediction results by article : \n\n")


Prediction results by article : 
  View(predict_finaldf)
# Random Forest prediction of fedPapersDF1 data
  EnsurePackage("randomForest")

  # View(fedPapersDF1)
  cat("\n All Articles by Author:")

 All Articles by Author:
  table(fedPapersDF$author)

   dispt Hamilton       HM      Jay  Madison 
      11       51        3        5       15 
    
  fit <- randomForest(y=fedPapersDF1$author, x=fedPapersDF1[2:ncol(fedPapersDF1)], data=fedPapersDF1, ntree=100
                      , keep.forest=FALSE, importance=TRUE)
  print(fit) # view results

Call:
 randomForest(x = fedPapersDF1[2:ncol(fedPapersDF1)], y = fedPapersDF1$author,      ntree = 100, importance = TRUE, keep.forest = FALSE, data = fedPapersDF1) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 8

        OOB estimate of  error rate: 20%
Confusion matrix:
         dispt Hamilton HM Jay Madison class.error
dispt        3        3  0   0       5   0.7272727
Hamilton     0       51  0   0       0   0.0000000
HM           1        0  0   0       2   1.0000000
Jay          2        1  0   1       1   0.8000000
Madison      1        1  0   0      13   0.1333333
  importance(fit) # importance of each predictor
               dispt    Hamilton         HM        Jay    Madison MeanDecreaseAccuracy MeanDecreaseGini
a       7.295406e-01  1.31389708  1.4285714  0.0000000  0.4968289           1.85552028       0.65119316
all     0.000000e+00  0.44985090  0.0000000  1.0050378  1.0050378           1.04802422       0.32462481
also    8.047741e-01 -0.28408957  0.0000000  1.3538810  1.4310476           1.16121998       1.08802976
an      1.823067e+00  3.10344808  0.0000000  1.9483241  1.9834264           3.92437078       1.53432189
and     1.942938e-01  2.75525053  0.0000000  3.0151134  1.5900315           3.52183438       2.30345199
any    -1.690309e+00  1.00618490  0.0000000  1.0050378  1.0665287           0.73366470       0.93553176
are     1.158490e+00 -1.00503782  0.0000000 -1.0050378  0.0000000          -0.01679945       0.26907343
as      1.688027e+00 -0.10606781  1.3538810  0.0000000 -0.5598318           1.50888983       0.65918280
at      1.005038e+00  1.18390806  0.0000000  0.0000000 -1.1280610           1.11018891       0.44831523
be      2.204007e+00 -0.05825098  1.0050378  0.0000000  1.0043544           2.15640466       0.76631605
been    1.060057e-01 -0.77793147  1.7586311  0.0000000 -0.6137462           0.02940495       0.90546440
but    -1.005038e+00 -0.86225174  0.0000000  0.0000000 -0.6033810          -1.25447538       0.24152540
by      2.982350e+00  3.00931669  0.0000000 -1.1915865  1.4249075           3.32753972       2.35737101
can    -5.579525e-01  1.42104221  0.0000000 -1.0050378  0.3630434           0.88289566       0.53384633
do     -1.005038e+00 -1.00503782  0.0000000  0.0000000  1.4242781          -0.02244506       0.26216153
down    0.000000e+00  0.00000000  0.0000000  0.0000000  0.0000000           0.00000000       0.00000000
even   -5.783149e-01  0.00000000  0.0000000  0.0000000 -1.7541160          -1.58671486       0.28950924
every   1.493480e+00  1.43055953  1.4285714  0.0000000  1.2761821           2.60576627       1.14529592
for.    0.000000e+00  1.00503782  0.0000000  0.0000000 -1.0050378          -0.02318078       0.20101329
from   -3.431991e-01 -0.31074635  0.0000000 -1.4002801 -0.5509390          -1.75441798       0.35806619
had     2.000400e-01  1.00503782  0.0000000  0.0000000 -1.0050378           0.30188737       0.22509336
has     2.327930e+00  0.03553427  0.0000000 -1.5249857  0.4816182           1.38576798       0.98691512
have    1.366722e+00 -0.42620706  0.0000000  1.0050378  0.3888079           0.88268208       0.40141251
her    -1.005038e+00  1.00503782  0.0000000  0.0000000 -1.0050378          -1.00503782       0.15280952
his    -1.280474e-01  1.00503782  0.0000000  0.0000000  0.4690364           1.05980251       0.23800000
if.    -8.409316e-01  0.06645958  0.0000000  0.0000000  1.0050378          -0.26595316       0.31971927
in.     3.345124e-01  1.01717357  1.0050378 -1.4002801  0.3872553           0.69230284       1.34079432
into    1.005038e+00  0.88379737  0.0000000  1.0050378 -1.0050378           0.96069733       0.30674774
is      2.627035e-01  1.00503782  1.0050378  1.0050378 -0.8322688           0.49841441       0.56641393
it     -7.942911e-01 -1.00503782  0.0000000  0.0000000 -1.2265140          -1.40474512       0.26183009
its     1.005038e+00  0.08304834 -1.0050378 -1.0050378  1.9137626           1.78220139       0.54259654
may     0.000000e+00  1.00503782  0.0000000 -1.0050378 -0.1561928           0.09944180       0.27158956
more   -2.325581e-01  0.24525400  0.0000000  0.0000000 -0.1084716           0.25574559       0.31549911
must    0.000000e+00  1.00503782  1.0050378  0.0000000  1.4242781           1.65078685       0.42001885
my      0.000000e+00  1.00503782  0.0000000  0.0000000 -1.7216897          -0.44743039       0.09682012
no      9.615735e-03  0.07369462 -0.4476615 -1.0050378  0.9879329           0.38198970       0.84016672
not     1.151825e+00 -1.03196684  1.3538810 -0.2774568 -1.3244845          -0.07093695       0.60606044
now     0.000000e+00 -1.00503782  0.0000000  0.0000000  0.0000000          -1.00503782       0.13929625
of     -9.921184e-01  0.68021873  0.0000000  2.4722853  0.8410479           1.53280125       1.28579447
on      1.451258e+00  4.36342544 -1.0050378 -1.5911978  2.1188560           3.81586034       2.79284944
one     4.267896e-01  1.42427806 -1.0050378  1.0050378  1.0050378           1.09430599       0.52074431
only   -1.005038e+00 -0.12775114  0.0000000  0.4476615  1.0050378          -0.07056397       0.52523582
or      0.000000e+00  1.00503782  0.0000000  0.0000000 -0.7558156           0.11797550       0.28375323
our     0.000000e+00  0.06501857  0.0000000  0.0000000 -0.1061236          -0.36920322       0.35557975
shall   1.749453e+00  0.65901824  0.0000000 -1.0050378  0.8675606           1.60240930       0.57966806
should -1.413925e+00  0.66654754  1.0050378  0.0000000  0.1289908           0.20633552       0.50907639
so      0.000000e+00  0.64457274  0.0000000  0.0000000  0.0000000           0.74421972       0.33291210
some    1.555910e+00 -0.05058133  0.0000000  0.0000000  1.2803688           1.71918265       0.44643757
such   -1.005038e+00  1.41785354  0.0000000  0.0000000  0.0000000           0.63279500       0.26098388
than    0.000000e+00 -1.42824159  1.0050378  0.0000000 -1.4242781          -1.57908883       0.38626672
that    4.476615e-01 -1.00503782  1.0050378  0.0000000  1.7227589           1.01225856       0.23127042
the     0.000000e+00 -0.09407625  0.0000000  2.1568925  0.8928348           1.83191730       1.20504815
their   1.513820e+00  0.48468813 -1.0050378  1.3538810 -0.2935817           2.12720053       0.64341943
then    0.000000e+00 -1.35388105 -1.0050378  0.0000000  0.0000000          -1.64240316       0.16344570
there   3.550505e-17  3.84019564  1.3538810 -0.2294761  3.7006005           4.05865516       2.86178759
things  0.000000e+00  0.00000000  0.0000000  0.0000000  0.0000000           0.00000000       0.01000000
this    0.000000e+00 -1.00503782  0.0000000  1.0050378  1.0050378           0.02175461       0.45741485
to     -8.608709e-01  3.35314254  1.0050378  0.3335187  1.4082016           3.98456227       2.47379188
up     -1.005038e+00  1.00503782  0.0000000  0.0000000  0.2722664           0.57900475       0.18023179
upon    4.665254e+00  7.25278888  1.0050378  2.7522860  4.3080837           7.62654635       5.66827478
was    -8.491993e-01  0.43238897  1.0050378 -1.4139250  2.4075514           1.13102028       0.70107848
were   -1.280474e-01 -1.00503782  0.0000000  0.0000000  0.5490802          -0.61172508       0.34497031
what    1.005038e+00 -1.00503782  0.0000000  0.0000000  0.2461646          -0.01845473       0.21280474
when    1.740777e+00  1.42857143 -1.0050378 -1.0050378  0.0000000           1.25400593       0.34647203
which   0.000000e+00  1.75809244  0.0000000  1.4002801  1.7309693           2.21774005       0.49400968
who    -1.005038e+00  1.00503782  0.0000000  0.0000000  0.0000000           0.48647227       0.24923787
will   -7.464487e-01 -0.02640373  0.0000000  1.0050378  1.1584896           0.02913939       0.65786191
with   -1.005038e+00 -1.74729192  0.0000000  1.0050378 -0.4726493          -1.58066626       0.28930846
would   7.101010e-17  2.05129105  0.0000000 -0.2774568  1.4242781           1.51921803       0.87874498
your    0.000000e+00  0.00000000  0.0000000  0.0000000  0.0000000           0.00000000       0.08685936
  rf_importance <- data.frame(importance(fit)) # importance of each predictor
  rf_importance
# Random Forest prediction of fedPapersDF1 data
  EnsurePackage("randomForest")

  # View(fedPapersDF1)
  cat("\n All Articles by Author:")

 All Articles by Author:
  table(fedPapersDF$author)

   dispt Hamilton       HM      Jay  Madison 
      11       51        3        5       15 
    
  fit <- randomForest(y=fedPapersDF1$author, x=fedPapersDF1[2:ncol(fedPapersDF1)], data=fedPapersDF1, ntree=100
                      , keep.forest=FALSE, importance=TRUE)
  print(fit) # view results

Call:
 randomForest(x = fedPapersDF1[2:ncol(fedPapersDF1)], y = fedPapersDF1$author,      ntree = 100, importance = TRUE, keep.forest = FALSE, data = fedPapersDF1) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 8

        OOB estimate of  error rate: 22.35%
Confusion matrix:
         dispt Hamilton HM Jay Madison class.error
dispt        2        2  0   0       7   0.8181818
Hamilton     0       51  0   0       0   0.0000000
HM           0        0  0   1       2   1.0000000
Jay          1        0  0   3       1   0.4000000
Madison      1        4  0   0      10   0.3333333
  importance(fit) # importance of each predictor
            dispt    Hamilton         HM        Jay     Madison MeanDecreaseAccuracy MeanDecreaseGini
a       0.8141880  2.40158236  1.6903085  1.7586311  0.52979064          2.723332456      1.074961499
all    -1.4285714  1.40028008  0.0000000  0.0000000  1.00503782          0.640097047      0.128500000
also   -0.6637233 -0.27553536  0.0000000  1.6903085 -1.26690431         -0.326233046      0.936142162
an      2.0108422  2.01342693  1.0050378  3.1094985  1.05814822          3.120989957      2.021625066
and    -1.3109902  3.01901653  1.3538810  3.0454787 -0.83891019          3.611830368      1.939410126
any    -0.2000400  1.55060537 -1.0050378 -1.0050378 -0.95836918          0.931891409      0.558205327
are     0.0000000 -1.03381961  0.0000000 -1.0050378  0.13442849         -0.504247075      0.382612923
as      0.2774568  0.35794034  1.4285714  0.0000000  0.57831493          0.864461770      0.609635689
at     -0.1533750  1.73863481  0.0000000  0.0000000  1.49036823          1.756078942      0.629939790
be      1.0050378  1.42380375  2.6926023 -1.0050378  0.86929312          2.415476889      0.831813979
been   -0.3360069  0.33261246  1.0050378  0.9082573  0.99647476          1.114090450      1.090647995
but    -0.2000400  1.50066897  0.0000000  1.0050378  0.08671426          1.422964691      0.440945151
by      0.0407573  2.04283057  0.0000000  0.6415003  0.13117478          1.891153699      1.652448186
can     0.0000000 -1.42857143  0.0000000  0.0000000  1.00503782         -0.295172222      0.399286617
do      0.0000000 -1.00503782 -1.0050378  0.0000000  0.00000000         -1.428362091      0.136002540
down   -1.0050378 -1.00503782  0.0000000  0.0000000 -1.00503782         -1.381317263      0.091571429
even    0.0000000 -1.42324644  0.0000000  0.0000000 -1.35388105         -1.258235639      0.159750000
every   0.0000000  1.54320180  1.0050378  1.0050378 -0.02794308          1.706150428      0.760287372
for.    0.0000000 -1.10944767  1.0050378  0.0000000 -0.79255609         -1.426556688      0.336744298
from    1.7015962 -0.05137261  0.0000000  0.0000000  1.42857143          1.202124003      0.655468678
had     0.0000000  0.00000000  0.0000000  1.0050378 -1.00503782          0.000000000      0.165669448
has     2.9499150  1.67777148 -1.7586311 -0.3335187  2.46510026          2.959424880      1.044080558
have    0.9581896 -0.12396938  0.0000000  0.0000000  0.89842912          0.778576015      0.727342007
her    -1.0050378 -0.14806287  0.0000000  0.0000000 -1.00503782         -1.428312025      0.122704798
his     0.0000000 -0.68728294  0.0000000  0.0000000  1.24580153          0.369637842      0.217884712
if.    -1.3538810  0.00000000  0.0000000  1.3538810 -1.42857143         -0.670831227      0.437049394
in.     1.3538810  0.85262077  1.4285714  1.0050378  0.21989985          1.219069134      0.775348077
into   -0.4476615 -1.12960533  0.0000000 -0.6876142 -1.00503782         -1.051314013      0.298076565
is      0.0000000  0.00000000 -1.4285714  1.5456644  0.30164849          0.933479188      0.315722222
it     -1.0050378  0.00000000 -1.0050378 -1.0050378 -1.00503782         -1.655142963      0.181166763
its    -0.5862104  1.30354589  0.0000000  0.0000000  0.02796735          0.668128833      0.498337173
may    -1.5862879  0.00000000  1.0050378 -1.0050378  1.34096493         -0.026517067      0.484785856
more    1.0050378  0.00000000  0.0000000  1.4285714 -1.00503782          1.073176089      0.276178880
must    1.0050378  1.00503782  0.0000000  0.0000000 -1.00503782          0.566037526      0.160461538
my      0.0000000  0.00000000  0.0000000  0.0000000  1.00503782          1.005037815      0.013333333
no      0.3181603 -2.01353792 -0.3335187  0.0000000  2.42139391          0.469212830      0.790700250
not    -0.3145027  1.42573279  1.0050378  1.4002801  0.29050934          1.106838732      0.808509656
now     1.7494534 -1.00503782 -1.0050378  0.0000000 -0.77693097         -0.409668274      0.293514508
of      0.1280474  2.73665158  0.0000000  3.1448545  1.23390468          3.354598823      1.711806594
on      3.0266908  4.36819343  0.4476615 -1.5911978  2.49389760          4.911140768      2.647717386
one     0.1561928  1.29262860  0.0000000  1.0050378  0.00000000          1.307375739      0.694572602
only    1.4285714  0.69548318  0.0000000  0.0000000 -1.32453236          0.480546683      0.467899952
or     -1.4002801 -0.82119764 -1.0050378 -1.0050378  0.33989333         -0.876655669      0.669918081
our    -1.4196573  1.42056323  0.0000000  1.4002801 -1.00503782          0.005730474      0.211757802
shall   1.4196573  1.00503782  0.0000000  0.0000000 -0.63765607          0.611309253      0.295124657
should  0.5783149  0.99640468  0.0000000  0.0000000  0.05414071          0.820640659      0.511198834
so      0.0000000  0.00000000  0.0000000  0.0000000 -0.20004001         -0.136235624      0.143777778
some    1.2751534  0.69109766  0.0000000  0.0000000  0.00000000          1.253351114      0.288539119
such    0.0000000  1.59992942  0.0000000  0.0000000 -1.35388105          1.047411699      0.188615348
than   -0.8559210  0.00000000  0.0000000  0.0000000 -0.05494227         -0.608422595      0.249236274
that    0.2182699  0.39949082  1.0050378  0.0000000  1.65493073          1.873838035      0.607429354
the     1.3538810  0.42119132  0.0000000  1.5294382  0.87574550          1.428560612      0.952728290
their   1.4285714  0.57528417  1.0050378  0.0000000 -0.78741686          0.862054063      0.400457686
then    0.0000000  1.00503782  0.0000000  0.0000000  0.00000000          1.005037815      0.189033993
there  -1.1020775  3.55783505  0.0000000  0.4476615  4.09896440          4.033428132      2.663423073
things -1.0050378  0.00000000  0.0000000  0.0000000  0.00000000         -1.005037815      0.066515152
this   -1.0050378  1.33160102  0.0000000  0.0000000  1.69725026          1.642999201      0.320695975
to     -0.7699206  3.25289118  1.0050378  0.0000000  2.85686860          3.155368935      3.058179318
up      0.0000000  0.00000000  1.0050378  0.0000000  0.00000000          1.005037815      0.125099778
upon    3.4574111  6.99757420  1.3538810  2.1882490  5.40596069          7.139375344      6.345506204
was    -1.1144180 -1.47463897  1.0050378  0.0000000 -1.00503782         -1.519821970      0.466117037
were   -0.3431991  1.09964909  0.0000000  0.0000000  1.00503782          1.436174623      0.506280580
what    0.0000000  1.00503782  0.0000000  0.0000000  0.00000000          1.005037815      0.111764706
when    1.4087457 -1.00503782  0.0000000 -1.0050378  0.42237411          0.113766554      0.248016106
which  -0.1561928  1.38063475 -1.0050378  1.9660901  1.74945338          2.480267695      0.660858240
who    -0.2000400 -0.87650722  0.0000000  0.0000000  0.00000000         -0.739704410      0.155808537
will   -0.9642111  1.30182796  0.0000000  0.0000000  0.30164849          1.005605556      0.731115521
with   -1.4285714  0.11565542  0.0000000  0.0000000  1.00503782         -0.442704303      0.323980357
would   1.3388584  1.53456170  1.0050378 -0.2182699  0.34492768          1.920871207      0.765781222
your    0.0000000 -1.00503782  0.0000000  0.0000000  0.00000000         -1.005037815      0.008768116
---
title: "R Notebook"
output: html_notebook
---

----------------------------------------------------------------------------------
Title: "IST707 HW5 Use Decision Tree to Solve a Mystery in History"
Name: Sathish Kumar Rajendiran
Date: 08/12/2020
-----------------------------------------------------------------------------------
Exercise: 
Use Decision Tree to Solve a Mystery in History: who wrote the disputed essays, Hamilton or Madison?

In this homework assignment, you are going to use the decision tree algorithm to solve the disputed essay problem. 
Last week you used clustering techniques to tackle this problem.

Organize your report using the following template: 

Section 1: Data preparation
  You will need to separate the original data set to training and testing data for classification experiments. 
  Describe what examples in your training and what in your test data.

Section 2: Build and tune decision tree models
  First build a DT model using default setting, and then tune the parameters to see if better model can be generated. 
  Compare these models using appropriate evaluation measures. Describe and compare the patterns learned in these models.
  
Section 3: Prediction
  After building the classification model, apply it to the disputed papers to find out the authorship. 
  Does the DT model reach the same conclusion as the clustering algorithms did?

```{r}
# import libraries 
#create a function to ensure the libraries are imported
EnsurePackage <- function(x){
  x <- as.character(x)
    if (!require(x,character.only = TRUE)){
      install.packages(pkgs=x, repos = "http://cran.us.r-project.org")
      require(x, character.only = TRUE)
    }
  }
# usage example, to load the necessary library for further processing...
EnsurePackage("ggplot2")
EnsurePackage("RColorBrewer")
EnsurePackage("NbClust")
EnsurePackage("caret")
EnsurePackage("rpart")
EnsurePackage("rpart.plot")
EnsurePackage("randomForest")
cat("All Packages are available")
```
```{r}
#Load CSV into data frame
  filepath <- "/Users/sathishrajendiran/Documents/R/fedPapers85.csv"
  fedPapersDF <- data.frame(read.csv(filepath,na.strings=c(""," ","NA")),stringsAsFactors=FALSE)
  dim(fedPapersDF) #85 72
```

```{r}
# Preview the structure 
  str(fedPapersDF)
# Analyze the spread  
  summary(fedPapersDF)
# Preview top few rows  
  head(fedPapersDF)
# compare number of articles by authors
  x <- data.frame(table(fedPapersDF$author))
  coul <- brewer.pal(5, "Set2")
  barplot(height=x$Freq, names=x$Var1, col=coul,xlab="Authors", 
        ylab="Number of Papers", 
        main="FedPapers85 by Authors", 
        ylim=c(0,60))
# view the data
  View(fedPapersDF)
```


```{r}
fedPapersDF1 <- subset(fedPapersDF,select=-filename)
```


```{r}
fedPapersDF1
```
```{r}
# Data preparation
  
  # 1. Training Set Preparation

  # Prepare dataframe by removing filename from the list
  fedPapersDF1 <- subset(fedPapersDF,select=-filename)
  fedPapersDF1
  
  set.seed(100)  

  # lets split disputed articles separately
  fedPapersDF_Dispt <- subset(fedPapersDF1, author=='dispt')
  # fedPapersDF_Dispt
  # lets split non-disputed articles separately
  fedPapersDF_authors <- subset(fedPapersDF1, author!='dispt')
  fedPapersDF_authors

  # lets split the non-disputed articles into training and test datasets.creates a value for dividing the data into train and test.
  # In this case the value is defined as   75% of the number of rows in the dataset
 
  sample_size = floor(0.70*nrow(fedPapersDF_authors)) # 65% --> 80% | 70% --> 82%  | 75 -->78% |80% -- 70%
  # sample_size #value of the sample size 55
  # 
  # set seed to ensure you always have same random numbers generated #324 has 100% training accuracy 
  train_index = sample(seq_len(nrow(fedPapersDF_authors)),size = sample_size)

  train_data =fedPapersDF_authors[train_index,] #creates the training dataset with row numbers stored in train_index
  # # table(train_data$author)
  test_data=fedPapersDF_authors[-train_index,]  # creates the test dataset excluding the row numbers mentioned in train_index
  # # table(test_data$author)
  # 
  cat("\nArticles by Author:")
  table(fedPapersDF_authors$author)
  cat("\nTrain_data - Articles by Author:")
  table(train_data$author)
  cat("\nTest_data - Articles by Author:")
  table(test_data$author)


```


```{r}
# Section 2: Build and tune decision tree models
  
  # grow tree
  rtree <- rpart(author~. ,data=train_data, method='class')

  #summarize rtree values
  summary(rtree)
  plotcp(rtree) # plot cross-validation results
  printcp(rtree) # plot cross-validation results

  # Plot tree | lets Plot decision trees
  rpart.plot(rtree,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
  rsq.rpart(rtree) # plot approximate R-squared and relative error for different splits (2 plots)

```
```{r}

  # grow tree  with cp=0
  rtree_0 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 0, maxdepth = 5)

  #summarize rtree values
  summary(rtree_0)
  plotcp(rtree_0) # plot cross-validation results
  printcp(rtree_0) # plot cross-validation results

  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_0,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
  rsq.rpart(rtree_0) # plot approximate R-squared and relative error for different splits (2 plots)
  
```


```{r}
  # grow tree  with cp=0 , minsplit = 10, maxdepth = 5
  rtree_1 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 1, maxdepth = 5)

  #summarize rtree values
  summary(rtree_1)
  plotcp(rtree_1) # plot cross-validation results
  printcp(rtree_1) # plot cross-validation results

  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_1,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
  rsq.rpart(rtree_1) # plot approximate R-squared and relative error for different splits (2 plots)
```
```{r}
  # grow tree  with cp=0 , minsplit = 10, maxdepth = 5
  rtree_2 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 2, maxdepth = 10)

  #summarize rtree values
  summary(rtree_2)
  plotcp(rtree_2) # plot cross-validation results
  printcp(rtree_2) # plot cross-validation results

  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_2,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
  rsq.rpart(rtree_2) # plot approximate R-squared and relative error for different splits (2 plots)
```
```{r}
  # grow tree  with cp=0 , minsplit = 10, maxdepth = 5
  rtree_3 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 3, maxdepth = 5)

  #summarize rtree values
  summary(rtree_3)
  plotcp(rtree_3) # plot cross-validation results
  printcp(rtree_3) # plot cross-validation results

  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_3,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
  rsq.rpart(rtree_3) # plot approximate R-squared and relative error for different splits (2 plots)
```

```{r}
  # grow tree  with cp=0 , minsplit = 10, maxdepth = 5
  rtree_4 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 4, maxdepth = 5)

  #summarize rtree values
  summary(rtree_4)
  plotcp(rtree_4) # plot cross-validation results
  printcp(rtree_4) # plot cross-validation results

  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_4,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
  rsq.rpart(rtree_4) # plot approximate R-squared and relative error for different splits (2 plots)
```
```{r}
 # grow tree  with cp=0 , minsplit = 3, maxdepth = 5, minbucket = 1
  rtree_5 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 3, maxdepth = 5, minbucket = 1)

  #summarize rtree values
  summary(rtree_5)
  plotcp(rtree_5) # plot cross-validation results
  printcp(rtree_5) # plot cross-validation results

  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_5,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
  rsq.rpart(rtree_5) # plot approximate R-squared and relative error for different splits (2 plots)
```


```{r}
  # grow tree  with cp=0 , minsplit = 10, maxdepth = 5
  rtree_10 <- rpart(author~. ,data=train_data, method='class', cp=0, minsplit = 4, maxdepth = 3,minbucket = round(5/3))

  #summarize rtree values
  summary(rtree_10)
  plotcp(rtree_10) # plot cross-validation results
  printcp(rtree_10) # plot cross-validation results

  # Plot tree | lets Plot decision trees
  rpart.plot(rtree_10,main="Classification Tree for fedPapers85", extra= 102) # plot decision tree
  rsq.rpart(rtree_10) # plot approximate R-squared and relative error for different splits (2 plots)
  
  cat("\nArticles by Author:")
  table(fedPapersDF_authors$author)
  cat("\nTrain_data - Articles by Author:")
  table(train_data$author)
  cat("\nTest_data - Articles by Author:")
  table(test_data$author)
  
  predict_unseen <- predict(rtree_10, test_data, type = 'class')
  # predict_unseen
  table_mat <- table(test_data$author, predict_unseen)
  cat("\n\nPrediction results : Confusion Matrix \n\n")
  # table_mat
  confusionMatrix(table_mat)
  
  
  
```

```{r}
# Section 3: Prediction | train data

  cat("\nArticles by Author:")
  table(fedPapersDF_authors$author)
  cat("\nTrain_data - Articles by Author:")
  table(train_data$author)
  cat("\nTest_data - Articles by Author:")
  table(test_data$author)


  predict_unseen <- predict(rtree_0, train_data, type = 'class')
  # predict_unseen
  table_mat <- table(train_data$author, predict_unseen)
  cat("\n\nPrediction results : Confusion Matrix \n\n")
  # table_mat
  confusionMatrix(table_mat)

```
```{r}
# Section 3: Prediction | Test Data

  cat("\nArticles by Author:")
  table(fedPapersDF_authors$author)
  cat("\nTrain_data - Articles by Author:")
  table(train_data$author)
  cat("\nTest_data - Articles by Author:")
  table(test_data$author)

  predict_DT0 <- predict(rtree_0, test_data, type = 'class')
  # predict_unseen
  table_DT0 <- table(test_data$author, predict_DT0)
  cat("\n\nPrediction results : Confusion Matrix \n\n")
  # table_mat
  confusionMatrix(table_DT0)
```
```{r}
# Section 3: Prediction | Test Data

  cat("\nArticles by Author:")
  table(fedPapersDF_authors$author)
  cat("\nTrain_data - Articles by Author:")
  table(train_data$author)
  cat("\nTest_data - Articles by Author:")
  table(test_data$author)


  predict_DT1 <- predict(rtree_1, test_data, type = 'class')
  # predict_unseen
  table_DT1 <- table(test_data$author, predict_DT1)
  cat("\n\nPrediction results : Confusion Matrix \n\n")
  # table_mat
  confusionMatrix(table_DT1)
```
```{r}
# Section 3: Prediction | Test Data

  cat("\nArticles by Author:")
  table(fedPapersDF_authors$author)
  cat("\nTrain_data - Articles by Author:")
  table(train_data$author)
  cat("\nTest_data - Articles by Author:")
  table(test_data$author)


  predict_DT2 <- predict(rtree_2, test_data, type = 'class')
  # predict_unseen
  table_DT2 <- table(test_data$author, predict_DT2)
  cat("\n\nPrediction results : Confusion Matrix \n\n")
  # table_mat
  confusionMatrix(table_DT2)
```

```{r}
# Section 3: Prediction | Test Data

  cat("\nArticles by Author:")
  table(fedPapersDF_authors$author)
  cat("\nTrain_data - Articles by Author:")
  table(train_data$author)
  cat("\nTest_data - Articles by Author:")
  table(test_data$author)


  predict_DT5 <- predict(rtree_5, test_data, type = 'class')
  # predict_unseen
  table_DT5 <- table(test_data$author, predict_DT5)
  cat("\n\nPrediction results : Confusion Matrix \n\n")
  # table_mat
  confusionMatrix(table_DT5)
```

```{r}
# Section 3: Prediction | Test Data

  cat("\nArticles by Author:")
  table(fedPapersDF_authors$author)
  cat("\nTrain_data - Articles by Author:")
  table(train_data$author)
  cat("\nTest_data - Articles by Author:")
  table(test_data$author)


  predict_DT10 <- predict(rtree_10, test_data, type = 'class')
  # predict_unseen
  table_DT10 <- table(test_data$author, predict_DT10)
  cat("\n\nPrediction results : Confusion Matrix \n\n")
  # table_mat
  confusionMatrix(table_DT10)
```

```{r}
# Section 3: Prediction | Disputed Data
  cat("\nDisputed Articles by Author:")
  table(fedPapersDF_Dispt$author)
  cat("\nArticles by Author:")
  table(fedPapersDF_authors$author)
  cat("\nTrain_data - Articles by Author:")
  table(train_data$author)
  cat("\nTest_data - Articles by Author:")
  table(test_data$author)
  predict_final <- predict(rtree_5, fedPapersDF_Dispt, type = 'class')
  table_final <- table(fedPapersDF_Dispt$author, predict_final)
  cat("\n\nPrediction results : \n\n")
  table_final 
```


```{r}
predict_finaldf <- data.frame(predict_final)
  cat("\n\nPrediction results by article : \n\n")
  View(predict_finaldf)
```


```{r}
# Random Forest prediction of fedPapersDF1 data
  EnsurePackage("randomForest")

  # View(fedPapersDF1)
  cat("\n All Articles by Author:")
  table(fedPapersDF$author)
    
  fit <- randomForest(y=fedPapersDF1$author, x=fedPapersDF1[2:ncol(fedPapersDF1)], data=fedPapersDF1, ntree=100
                      , keep.forest=FALSE, importance=TRUE)
  print(fit) # view results
  importance(fit) # importance of each predictor
```
```{r}
  rf_importance <- data.frame(importance(fit)) # importance of each predictor
  rf_importance
```
```{r}
# Random Forest prediction of fedPapersDF1 data
  EnsurePackage("randomForest")

  # View(fedPapersDF1)
  cat("\n All Articles by Author:")
  table(fedPapersDF$author)
    
  fit <- randomForest(y=fedPapersDF1$author, x=fedPapersDF1[2:ncol(fedPapersDF1)], data=fedPapersDF1, ntree=100
                      , keep.forest=FALSE, importance=TRUE)
  print(fit) # view results
  importance(fit) # importance of each predictor
```


